forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Vectors.Rmd
989 lines (698 loc) · 39.6 KB
/
Vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
# Vectors {#vectors-chap}
```{r setup, include = FALSE}
source("common.R")
```
## Introduction
\index{vectors}
\index{nodes}
This chapter discusses the most important family of data types in base R: vectors[^node]. While you've probably already used many (if not all) of the different types of vectors, you may not have thought deeply about how they're interrelated. In this chapter, I won't cover individual vectors types in too much detail, but I will show you how all the types fit together as a whole. If you need more details, you can find them in R's documentation.
[^node]: Collectively, all the other data types are known as "node" types, which include things like functions and environments. You're most likely to come across this highly technical term when using `gc()`: the "N" in `Ncells` stands for nodes and the "V" in `Vcells` stands for vectors.
Vectors come in two flavours: atomic vectors and lists[^generic-vectors]. They differ in terms of their elements' types: for atomic vectors, all elements must have the same type; for lists, elements can have different types. While not a vector, `NULL` is closely related to vectors and often serves the role of a generic zero length vector. This diagram, which we'll be expanding on throughout this chapter, illustrates the basic relationships:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/summary-tree.png")
```
[^generic-vectors]: A few places in R's documentation call lists generic vectors to emphasise their difference from atomic vectors.
Every vector can also have __attributes__, which you can think of as a named list of arbitrary metadata. Two attributes are particularly important. The **dim**ension attribute turns vectors into matrices and arrays and the __class__ attribute powers the S3 object system. While you'll learn how to use S3 in Chapter \@ref(s3)), here you'll learn about some of the most important S3 vectors: factors, date/times, data frames, and tibbles. And while 2D structures like matrices and data frames are not necessarily what come to mind when you think of vectors, you'll also learn why R considers them to be vectors.
### Quiz {-}
Take this short quiz to determine if you need to read this chapter. If the answers quickly come to mind, you can comfortably skip this chapter. You can check your answers in Section \@ref(data-structure-answers).
1. What are the four common types of atomic vectors? What are the two
rare types?
1. What are attributes? How do you get them and set them?
1. How is a list different from an atomic vector? How is a matrix different
from a data frame?
1. Can you have a list that is a matrix? Can a data frame have a column
that is a matrix?
1. How do tibbles behave differently from data frames?
### Outline {-}
* Section \@ref(atomic-vectors) introduces you to the atomic vectors:
logical, integer, double, and character. These are R's simplest data
structures.
* Section \@ref(attributes) takes a small detour to discuss attributes,
R's flexible metadata specification. The most important attributes are
names, dimensions, and class.
* Section \@ref(s3-atomic-vectors) discusses the important vector types that
are built by combining atomic vectors with special attributes. These include
factors, dates, date-times, and durations.
* Section \@ref(lists) dives into lists. Lists are very similar to atomic
vectors, but have one key difference: an element of a list can be any
data type, including another list. This makes them suitable for representing
hierarchical data.
* Section \@ref(tibble) teaches you about data frames and tibbles, which
are used to represent rectangular data. They combine the behaviour
of lists and matrices to make a structure ideally suited for the needs
of statistical data.
## Atomic vectors
\index{atomic vectors}
\index{vectors!atomic|see {atomic vectors}}
\index{logical vectors}
\index{integer vectors}
\index{double vectors}
\index{numeric vectors}
\index{character vectors}
There are four primary types of atomic vectors: logical, integer, double, and character (which contains strings). Collectively integer and double vectors are known as numeric vectors[^numeric]. There are two rare types: complex and raw. I won't discuss them further because complex numbers are rarely needed in statistics, and raw vectors are a special type that's only needed when handling binary data.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/summary-tree-atomic.png")
```
[^numeric]: This is a slight simplification as R does not use "numeric" consistently, which we'll come back to in Section \@ref(numeric-type).
### Scalars
\index{scalars}
\indexc{NA}
\index{missing values|see {\texttt{NA}}}
\indexc{NaN}
\indexc{Inf}
\indexc{L}
\indexc{""}
\indexc{'}
Each of the four primary types has a special syntax to create an individual value, AKA a __scalar__[^scalar], and its own missing value.
* Strings are surrounded by `"` (`"hi"`) or `'` (`'bye'`). Special characters
are escaped with `\`; see `?Quotes` for full details. The missing value
for strings is `NA_character_`.
* Doubles can be specified in decimal (`0.1234`), scientific (`1.23e4`), or
hexadecimal (`0xcafe`) form. There are three special values unique to
doubles: `Inf`, `-Inf`, and `NaN` (not a number). These are special values
defined by the floating point standard. The missing value for doubles is
`NA_real_`.
* Integers are written similarly to doubles but must be followed by `L`[^L-suffix]
(`1234L`, `1e4L`, or `0xcafeL`), and can not include decimals. The integer
missing value is `NA_integer_`.
* Logicals can be spelt out (`TRUE` or `FALSE`), or abbreviated (`T` or `F`).
The missing value for logicals is `NA`.
[^L-suffix]: `L` is not intuitive, and you might wonder where it comes from. At the time `L` was added to R, R's integer type was equivalent to a long integer in C, and C code could use a suffix of `l` or `L` to force a number to be a long integer. It was decided that `l` was too visually similar to `i` (used for complex numbers in R), leaving `L`.
[^scalar]: Technically, the R language does not possess scalars. Everything that looks like a scalar is actually a vector of length one. This is mostly a theoretical distinction, but it does mean that expressions like `1[1]` work.
### Making longer vectors with `c()` {#atomic-constructing}
\indexc{typeof()}
\indexc{length()}
\indexc{c()}
To create longer vectors from shorter ones, use `c()`, short for combine:
```{r}
dbl_var <- c(1, 2.5, 4.5)
int_var <- c(1L, 6L, 10L)
lgl_var <- c(TRUE, FALSE)
chr_var <- c("these are", "some strings")
```
When the inputs are atomic vectors, `c()` always creates another atomic vector; i.e. it flattens:
```{r}
c(c(1, 2), c(3, 4))
```
In diagrams, I'll depict vectors as connected rectangles, so the above code could be drawn as follows:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/atomic.png")
```
You can determine the type of a vector with `typeof()`[^mode] and its length with `length()`.
```{r}
typeof(dbl_var)
typeof(int_var)
typeof(lgl_var)
typeof(chr_var)
```
[^mode]: You may have heard of the related `mode()` and `storage.mode()` functions. Do not use them: they exist only for compatibility with S.
### Testing and coercion
\index{coercion}
\indexc{is.vector()}
\indexc{is.atomic()}
\indexc{is.numeric()}
Generally, you can __test__ if a vector is of a given type with an `is.*()` function, but they need to be used with care. `is.character()`, `is.double()`, `is.integer()`, and `is.logical()` do what you might expect: they test if a vector is a character, double, integer, or logical. Avoid `is.vector()`, `is.atomic()`, and `is.numeric()`: they don't test if you have a vector, atomic vector, or numeric vector; you'll need to carefully read the docs to figure out what they actually do.
For atomic vectors, type is a property of the entire vector: all elements must be the same type. When you attempt to combine different types they will be __coerced__ in a fixed order: character → double → integer → logical. For example, combining a character and an integer yields a character:
```{r}
str(c("a", 1))
```
Coercion often happens automatically. Most mathematical functions (`+`, `log`, `abs`, etc.) will coerce to numeric. This coercion is particularly useful for logical vectors because `TRUE` becomes 1 and `FALSE` becomes 0.
```{r}
x <- c(FALSE, FALSE, TRUE)
as.numeric(x)
# Total number of TRUEs
sum(x)
# Proportion that are TRUE
mean(x)
```
Generally, you can deliberately coerce by using an `as.*()` function, like `as.character()`, `as.double()`, `as.integer()`, or `as.logical()`. Failed coercion of strings generates a warning and a missing value:
```{r}
as.integer(c("1", "1.5", "a"))
```
### Exercises
1. How do you create raw and complex scalars? (See `?raw` and
`?complex`)
1. Test your knowledge of the vector coercion rules by predicting the output of
the following uses of `c()`:
```{r, eval=FALSE}
c(1, FALSE)
c("a", 1)
c(TRUE, 1L)
```
1. Why is `1 == "1"` true? Why is `-1 < FALSE` true? Why is `"one" < 2` false?
1. Why is the default missing value, `NA`, a logical vector? What's special
about logical vectors? (Hint: think about `c(FALSE, NA_character_)`.)
1. Precisely what do `is.atomic()`, `is.numeric()`, and `is.vector()` test for?
## Attributes {#attributes}
\index{attributes}
You might have noticed that the set of atomic vectors does not include a number of important data structures like matrices and arrays, factors and date/times. These types are built on top of atomic vectors by adding attributes. In this section, you'll learn the basics of attributes, and how the dim attribute makes matrices and arrays. In the next section you'll learn how the class attribute is used to create S3 vectors, including factors, dates, and date-times.
### Getting and setting
\indexc{attr()}
\index{attributes!attributes@\texttt{attributes()}}
\indexc{structure()}
You can think of attributes as name-value pairs[^pairlist] that attach metadata to an object. Individual attributes can be retrieved and modified with `attr()`, or retrieved en masse with `attributes()`, and set en masse with `structure()`.
[^pairlist]: Attributes behave like named lists, but are actually pairlists. Pairlists are functionally indistinguishable from lists, but are profoundly different under the hood. You'll learn more about them in Section \@ref(pairlists).
```{r}
a <- 1:3
attr(a, "x") <- "abcdef"
attr(a, "x")
attr(a, "y") <- 4:6
str(attributes(a))
# Or equivalently
a <- structure(
1:3,
x = "abcdef",
y = 4:6
)
str(attributes(a))
```
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/attr.png")
```
Attributes should generally be thought of as ephemeral. For example, most attributes are lost by most operations:
```{r}
attributes(a[1])
attributes(sum(a))
```
There are only two attributes that are routinely preserved:
* __names__, a character vector giving each element a name.
* __dim__, short for dimensions, an integer vector, used to turn vectors
into matrices or arrays.
To preserve other attributes, you'll need to create your own S3 class, the topic of Chapter \@ref(s3).
### Names {#attr-names}
\index{attributes!names}
\indexc{names()}
\indexc{setNames()}
You can name a vector in three ways:
```{r}
# When creating it:
x <- c(a = 1, b = 2, c = 3)
# By assigning a character vector to names()
x <- 1:3
names(x) <- c("a", "b", "c")
# Inline, with setNames():
x <- setNames(1:3, c("a", "b", "c"))
```
Avoid using `attr(x, "names")` as it requires more typing and is less readable than `names(x)`. You can remove names from a vector by using `unname(x)` or `names(x) <- NULL`.
To be technically correct, when drawing the named vector `x`, I should draw it like so:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/attr-names-1.png")
```
However, names are so special and so important, that unless I'm trying specifically to draw attention to the attributes data structure, I'll use them to label the vector directly:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/attr-names-2.png")
```
To be useful with character subsetting (e.g. Section \@ref(lookup-tables)) names should be unique, and non-missing, but this is not enforced by R. Depending on how the names are set, missing names may be either `""` or `NA_character_`. If all names are missing, `names()` will return `NULL`.
### Dimensions {#attr-dims}
\index{arrays}
\index{matrices|see {arrays}}
\index{attributes!dimensions}
Adding a `dim` attribute to a vector allows it to behave like a 2-dimensional __matrix__ or a multi-dimensional __array__. Matrices and arrays are primarily mathematical/statistical tools, not programming tools, so they'll be used infrequently and only covered briefly in this book. Their most important feature is multidimensional subsetting, which is covered in Section \@ref(matrix-subsetting).
You can create matrices and arrays with `matrix()` and `array()`, or by using the assignment form of `dim()`:
```{r}
# Two scalar arguments specify row and column sizes
a <- matrix(1:6, nrow = 2, ncol = 3)
a
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
b
# You can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)
c
```
Many of the functions for working with vectors have generalisations for matrices and arrays:
| Vector | Matrix | Array |
|-------------------|----------------------------|------------------|
| `names()` | `rownames()`, `colnames()` | `dimnames()` |
| `length()` | `nrow()`, `ncol()` | `dim()` |
| `c()` | `rbind()`, `cbind()` | `abind::abind()` |
| --- | `t()` | `aperm()` |
| `is.null(dim(x))` | `is.matrix()` | `is.array()` |
A vector without a `dim` attribute set is often thought of as 1-dimensional, but actually has `NULL` dimensions. You also can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren't too important, but it's useful to know they exist in case you get strange output from a function (`tapply()` is a frequent offender). As always, use `str()` to reveal the differences.
```{r}
str(1:3) # 1d vector
str(matrix(1:3, ncol = 1)) # column vector
str(matrix(1:3, nrow = 1)) # row vector
str(array(1:3, 3)) # "array" vector
```
### Exercises
1. How is `setNames()` implemented? How is `unname()` implemented?
Read the source code.
1. What does `dim()` return when applied to a 1D vector?
When might you use `NROW()` or `NCOL()`?
1. How would you describe the following three objects? What makes them
different from `1:5`?
```{r}
x1 <- array(1:5, c(1, 1, 5))
x2 <- array(1:5, c(1, 5, 1))
x3 <- array(1:5, c(5, 1, 1))
```
1. An early draft used this code to illustrate `structure()`:
```{r}
structure(1:5, comment = "my attribute")
```
But when you print that object you don't see the comment attribute.
Why? Is the attribute missing, or is there something else special about
it? (Hint: try using help.)
## S3 atomic vectors
\index{attributes!S3}
\index{S3!vectors}
One of the most important vector attributes is `class`, which underlies the S3 object system. Having a class attribute turns an object into an __S3 object__, which means it will behave differently from a regular vector when passed to a __generic__ function. Every S3 object is built on top of a base type, and often stores additional information in other attributes. You'll learn the details of the S3 object system, and how to create your own S3 classes, in Chapter \@ref(s3).
In this section, we'll discuss four important S3 vectors used in base R:
* Categorical data, where values come from a fixed set of levels recorded in
__factor__ vectors.
* Dates (with day resolution), which are recorded in __Date__ vectors.
* Date-times (with second or sub-second resolution), which are stored in
__POSIXct__ vectors.
* Durations, which are stored in __difftime__ vectors.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/summary-tree-s3-1.png")
```
### Factors
\indexc{factor}
\indexc{stringsAsFactors}
A factor is a vector that can contain only predefined values. It is used to store categorical data. Factors are built on top of an integer vector with two attributes: a `class`, "factor", which makes it behave differently from regular integer vectors, and `levels`, which defines the set of allowed values.
```{r}
x <- factor(c("a", "b", "b", "a"))
x
typeof(x)
attributes(x)
```
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/factor.png")
```
Factors are useful when you know the set of possible values but they're not all present in a given dataset. In contrast to a character vector, when you tabulate a factor you'll get counts of all categories, even unobserved ones:
```{r}
sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))
table(sex_char)
table(sex_factor)
```
A minor variation on factors are __ordered__ factors. In general, they behave like regular factors, but the order of the levels is meaningful ("low", "medium", "high") (a property that is automatically leveraged by some modelling and visualisation functions).
```{r}
grade <- ordered(c("b", "b", "a", "c"), levels = c("c", "b", "a"))
grade
```
In base R[^tidyverse-factors] you tend to encounter factors very frequently because many base R functions (like `read.csv()` and `data.frame()`) automatically convert character vectors to factors. This is suboptimal because there's no way for those functions to know the set of all possible levels or their correct order: the levels are a property of theory or experimental design, not of the data. Instead, use the argument `stringsAsFactors = FALSE` to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the "theoretical" data. To learn about the historical context of this behaviour, I recommend [*stringsAsFactors: An unauthorized
biography*](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng, and [*stringsAsFactors =
\<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley.
[^tidyverse-factors]: The tidyverse never automatically coerces characters to factors, and provides the forcats [@forcats] package specifically for working with factors.
While factors look like (and often behave like) character vectors, they are built on top of integers. So be careful when treating them like strings. Some string methods (like `gsub()` and `grepl()`) will automatically coerce factors to strings, others (like `nchar()`) will throw an error, and still others will (like `c()`) use the underlying integer values. For this reason, it's usually best to explicitly convert factors to character vectors if you need string-like behaviour.
### Dates
\indexc{Date}
Date vectors are built on top of double vectors. They have class "Date" and no other attributes:
```{r}
today <- Sys.Date()
typeof(today)
attributes(today)
```
The value of the double (which can be seen by stripping the class), represents the number of days since 1970-01-01[^epoch]:
```{r}
date <- as.Date("1970-02-01")
unclass(date)
```
[^epoch]: This is special date is known as the Unix Epoch.
### Date-times
\index{date-times|see {\texttt{POSIXct}}}
\indexc{POSIXct}
Base R[^tidyverse-datetimes] provides two ways of storing date-time information, POSIXct, and POSIXlt. These are admittedly odd names: "POSIX" is short for Portable Operating System Interface, which is a family of cross-platform standards. "ct" standards for calendar time (the `time_t` type in C), and "lt" for local time (the `struct tm` type in C). Here we'll focus on `POSIXct`, because it's the simplest, is built on top of an atomic vector, and is most appropriate for use in data frames. POSIXct vectors are built on top of double vectors, where the value represents the number of seconds since 1970-01-01.
```{r}
now_ct <- as.POSIXct("2018-08-01 22:00", tz = "UTC")
now_ct
typeof(now_ct)
attributes(now_ct)
```
The `tzone` attribute controls only how the date-time is formatted; it does not control the instant of time represented by the vector. Note that the time is not printed if it is midnight.
```{r}
structure(now_ct, tzone = "Asia/Tokyo")
structure(now_ct, tzone = "America/New_York")
structure(now_ct, tzone = "Australia/Lord_Howe")
structure(now_ct, tzone = "Europe/Paris")
```
[^tidyverse-datetimes]: The tidyverse provides the lubridate [@lubridate] package for working with date-times. It provides a number of convenient helpers that work with the base POSIXct type.
### Durations
\index{durations|see {difftime}}
\indexc{difftime}
Durations, the amount of time between two dates or date times, are stored in difftimes. Difftimes are built on top of doubles, and have a units attribute that determines how the integer should be interpreted:
```{r}
one_week_1 <- as.difftime(1, units = "weeks")
one_week_1
typeof(one_week_1)
attributes(one_week_1)
one_week_2 <- as.difftime(7, units = "days")
one_week_2
typeof(one_week_2)
attributes(one_week_2)
```
### Exercises
1. What sort of object does `table()` return? What is its type? What
attributes does it have? How does the dimensionality change as you
tabulate more variables?
1. What happens to a factor when you modify its levels?
```{r, results = FALSE}
f1 <- factor(letters)
levels(f1) <- rev(levels(f1))
```
1. What does this code do? How do `f2` and `f3` differ from `f1`?
```{r, results = FALSE}
f2 <- rev(factor(letters))
f3 <- factor(letters, levels = rev(letters))
```
## Lists
\index{lists}
\index{vectors!recursive|see {lists}}
\index{vectors!generic|see {lists}}
Lists are a step up in complexity from atomic vectors: each element can be any type, not just vectors. Technically speaking, each element of a list is actually the same type because, as you saw in Section \@ref(list-references), each element is really a _reference_ to another object, which can be any type.
### Creating {#list-creating}
\indexc{list()}
You construct lists with `list()`:
```{r}
l1 <- list(
1:3,
"a",
c(TRUE, FALSE, TRUE),
c(2.3, 5.9)
)
typeof(l1)
str(l1)
```
Because the elements of a list are references, creating a list does not involve copying the components into the list. For this reason, the total size of a list might be smaller than you might expect.
```{r}
lobstr::obj_size(mtcars)
l2 <- list(mtcars, mtcars, mtcars, mtcars)
lobstr::obj_size(l2)
```
Lists can contain complex objects so it's not possible to pick a single visual style that works for every list. Generally I'll draw lists like vectors, using colour to remind you of the hierarchy.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/list.png")
```
Lists are sometimes called __recursive__ vectors because a list can contain other lists. This makes them fundamentally different from atomic vectors.
```{r}
l3 <- list(list(list(1)))
str(l3)
```
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/list-recursive.png")
```
`c()` will combine several lists into one. If given a combination of atomic vectors and lists, `c()` will coerce the vectors to lists before combining them. Compare the results of `list()` and `c()`:
```{r}
l4 <- list(list(1, 2), c(3, 4))
l5 <- c(list(1, 2), c(3, 4))
str(l4)
str(l5)
```
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/list-c.png")
```
### Testing and coercion {#list-types}
The `typeof()` a list is `list`. You can test for a list with `is.list()`, and coerce to a list with `as.list()`.
```{r}
list(1:3)
as.list(1:3)
```
You can turn a list into an atomic vector with `unlist()`. The rules for the resulting type are complex, not well documented, and not always equivalent to what you'd get with `c()`.
### Matrices and arrays {#list-array}
\index{lists!list-arrays}
\index{arrays!list-arrays}
With atomic vectors, the dimension attribute is commonly used to create matrices. With lists, the dimension attribute can be used to create list-matrices or list-arrays:
```{r}
l <- list(1:3, "a", TRUE, 1.0)
dim(l) <- c(2, 2)
l
l[[1, 1]]
```
These data structures are relatively esoteric but they can be useful if you want to arrange objects in a grid-like structure. For example, if you're running models on a spatio-temporal grid, it might be more intuitive to store the models in a 3D array that matches the grid structure.
### Exercises
1. List all the ways that a list differs from an atomic vector.
1. Why do you need to use `unlist()` to convert a list to an
atomic vector? Why doesn't `as.vector()` work?
1. Compare and contrast `c()` and `unlist()` when combining a
date and date-time into a single vector.
## Data frames and tibbles {#tibble}
\index{data frames}
\index{tibbles|see {data frames}}
\indexc{row.names}
The two most important S3 vectors built on top of lists are data frames and tibbles.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/summary-tree-s3-2.png")
```
If you do data analysis in R, you're going to be using data frames. A data frame is a named list of vectors with attributes for (column) `names`, `row.names`[^rownames], and its class, "data.frame":
```{r}
df1 <- data.frame(x = 1:3, y = letters[1:3])
typeof(df1)
attributes(df1)
```
[^rownames]: Row names are one of the most surprisingly complex data structures in R. They've also been a persistent source of performance issues over the years. The most straightforward implementation is a character or integer vector, with one element for each row. But there's also a compact representation for "automatic" row names (consecutive integers), created by `.set_row_names()`. R 3.5 has a special way of deferring integer to character conversion that is specifically designed to speed up `lm()`; see <https://svn.r-project.org/R/branches/ALTREP/ALTREP.html#deferred_string_conversions> for details.
In contrast to a regular list, a data frame has an additional constraint: the length of each of its vectors must be the same. This gives data frames their rectangular structure and explains why they share the properties of both matrices and lists:
* A data frame has `rownames()`[^row.names] and `colnames()`. The `names()`
of a data frame are the column names.
* A data frame has `nrow()` rows and `ncol()` columns. The `length()` of a
data frame gives the number of columns.
[^row.names]: Technically, you are encouraged to use `row.names()`, not `rownames()` with data frames, but this distinction is rarely important.
Data frames are one of the biggest and most important ideas in R, and one of the things that makes R different from other programming languages. However, in the over 20 years since their creation, the ways that people use R have changed, and some of the design decisions that made sense at the time data frames were created now cause frustration.
This frustration lead to the creation of the tibble [@tibble], a modern reimagining of the data frame. Tibbles are designed to be (as much as possible) drop-in replacements for data frames that fix those frustrations. A concise, and fun, way to summarise the main differences is that tibbles are lazy and surly: they do less and complain more. You'll see what that means as you work through this section.
Tibbles are provided by the tibble package and share the same structure as data frames. The only difference is that the class vector is longer, and includes `tbl_df`. This allows tibbles to behave differently in the key ways which we'll discuss below.
```{r}
library(tibble)
df2 <- tibble(x = 1:3, y = letters[1:3])
typeof(df2)
attributes(df2)
```
### Creating {#df-create}
\indexc{stringsAsFactors}
\index{data frames!data.frame@\texttt{data.frame()}}
You create a data frame by supplying name-vector pairs to `data.frame()`:
```{r}
df <- data.frame(
x = 1:3,
y = c("a", "b", "c")
)
str(df)
```
Beware of the default conversion of strings to factors. Use `stringsAsFactors = FALSE` to suppress this and keep character vectors as character vectors:
```{r}
df1 <- data.frame(
x = 1:3,
y = c("a", "b", "c"),
stringsAsFactors = FALSE
)
str(df1)
```
Creating a tibble is similar to creating a data frame. The difference between the two is that tibbles never coerce their input (this is one feature that makes them lazy):
```{r}
df2 <- tibble(
x = 1:3,
y = c("a", "b", "c")
)
str(df2)
```
Additionally, while data frames automatically transform non-syntactic names (unless `check.names = FALSE`), tibbles do not (although they do print non-syntactic names surrounded by `` ` ``).
```{r}
names(data.frame(`1` = 1))
names(tibble(`1` = 1))
```
While every element of a data frame (or tibble) must have the same length, both `data.frame()` and `tibble()` will recycle shorter inputs. However, while data frames automatically recycle columns that are an integer multiple of the longest column, tibbles will only recycle vectors of length one.
```{r, error = TRUE}
data.frame(x = 1:4, y = 1:2)
data.frame(x = 1:4, y = 1:3)
tibble(x = 1:4, y = 1)
tibble(x = 1:4, y = 1:2)
```
There is one final difference: `tibble()` allows you to refer to variables created during construction:
```{r}
tibble(
x = 1:3,
y = x * 2
)
```
(Inputs are evaluated left-to-right.)
When drawing data frames and tibbles, rather than focussing on the implementation details, i.e. the attributes:
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/data-frame-1.png")
```
I'll draw them the same way as a named list, but arrange them to emphasise their columnar structure.
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/data-frame-2.png")
```
### Row names {#rownames}
\indexc{row.names}
Data frames allow you to label each row with a "name", a character vector containing only unique values:
```{r}
df3 <- data.frame(
age = c(35, 27, 18),
hair = c("blond", "brown", "black"),
row.names = c("Bob", "Susan", "Sam")
)
df3
```
You can get and set row names with `rownames()`, and you can use them to subset rows:
```{r}
rownames(df3)
df3["Bob", ]
```
Row names arise naturally if you think of data frames as 2D structures like matrices: columns (variables) have names so rows (observations) should too. Most matrices are numeric, so having a place to store character labels is important. But this analogy to matrices is misleading because matrices possess an important property that data frames do not: they are transposable. In matrices the rows and columns are interchangeable, and transposing a matrix gives you another matrix (transposing again gives you the original matrix). With data frames, however, the rows and columns are not interchangeable: the transpose of a data frame is not a data frame.
There are three reasons why row names are undesirable:
* Metadata is data, so storing it in a different way to the rest of the
data is fundamentally a bad idea. It also means that you need to learn
a new set of tools to work with row names; you can't use what you already
know about manipulating columns.
* Row names are a poor abstraction for labelling rows because they only work
when a row can be identified by a single string. This fails in many cases,
for example when you want to identify a row by a non-character vector
(e.g. a time point), or with multiple vectors (e.g. position, encoded by
latitude and longitude).
* Row names must be unique, so any duplication of rows (e.g. from
bootstrapping) will create new row names. If you want to match rows from
before and after the transformation, you'll need to perform complicated
string surgery.
```{r}
df3[c(1, 1, 1), ]
```
For these reasons, tibbles do not support row names. Instead the tibble package provides tools to easily convert row names into a regular column with either `rownames_to_column()`, or the `rownames` argument in `as_tibble()`:
```{r}
as_tibble(df3, rownames = "name")
```
### Printing
One of the most obvious differences between tibbles and data frames is how they print. I assume that you're already familiar with how data frames are printed, so here I'll highlight some of the biggest differences using an example dataset included in the dplyr package:
```{r}
dplyr::starwars
```
* Tibbles only show the first 10 rows and all the columns that will fit on
screen. Additional columns are shown at the bottom.
* Each column is labelled with its type, abbreviated to three or four letters.
* Wide columns are truncated to avoid having a single long string occupy an
entire row. (This is still a work in progress: it's a tricky tradeoff between
showing as many columns as possible and showing columns in their entirety.)
* When used in console environments that support it, colour is used judiciously
to highlight important information, and de-emphasise supplemental details.
### Subsetting {#safe-subsetting}
As you will learn in Chapter \@ref(subsetting), you can subset a data frame or a tibble like a 1D structure (where it behaves like a list), or a 2D structure (where it behaves like a matrix).
In my opinion, data frames have two undesirable subsetting behaviours:
* When you subset columns with `df[, vars]`, you will get a vector if `vars`
selects one variable, otherwise you'll get a data frame. This is a frequent
source of bugs when using `[` in a function, unless you always remember to
use `df[, vars, drop = FALSE]`.
* When you attempt to extract a single column with `df$x` and there is no
column `x`, a data frame will instead select any variable that starts with
`x`. If no variable starts with `x`, `df$x` will return `NULL`. This makes
it easy to select the wrong variable or to select a variable that doesn't
exist.
Tibbles tweak these behaviours so that a [ always returns a tibble, and a $ doesn't do partial matching and warns if it can't find a variable (this is what makes tibbles surly).
```{r opts, include = FALSE}
opts <- options(warnPartialMatchDollar = FALSE)
```
```{r, dependson="opts"}
df1 <- data.frame(xyz = "a")
df2 <- tibble(xyz = "a")
str(df1$x)
str(df2$x)
```
```{r, include = FALSE}
if (!is.null(opts$warnPartialMatchDollar))
options(opts)
```
A tibble's insistence on returning a data frame from `[` can cause problems with legacy code, which often uses `df[, "col"]` to extract a single column. If you want a single column, I recommend using `df[["col"]]`. This clearly communicates your intent, and works with both data frames and tibbles.
### Testing and coercing {#df-test-coerce}
To check if an object is a data frame or tibble, use `is.data.frame()`:
```{r}
is.data.frame(df1)
is.data.frame(df2)
```
Typically, it should not matter if you have a tibble or data frame, but if you need to be certain, use `is_tibble()`:
```{r}
is_tibble(df1)
is_tibble(df2)
```
You can coerce an object to a data frame with `as.data.frame()` or to a tibble with `as_tibble()`.
### List columns
\index{data frames!list-columns}
\indexc{I()}
Since a data frame is a list of vectors, it is possible for a data frame to have a column that is a list. This is very useful because a list can contain any other object: this means you can put any object in a data frame. This allows you to keep related objects together in a row, no matter how complex the individual objects are. You can see an application of this in the "Many Models" chapter of "R for Data Science", <http://r4ds.had.co.nz/many-models.html>.
List-columns are allowed in data frames but you have to do a little extra work by either adding the list-column after creation or wrapping the list in `I()`[^identity].
[^identity]: `I()` is short for identity and is often used to indicate that an input should be left as is, and not automatically transformed.
```{r}
df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
data.frame(
x = 1:3,
y = I(list(1:2, 1:3, 1:4))
)
```
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/data-frame-list.png")
```
List columns are easier to use with tibbles because they can be directly included inside tibble() and they will be printed tidily:
```{r}
tibble(
x = 1:3,
y = list(1:2, 1:3, 1:4)
)
```
### Matrix and data frame columns
\index{data frames!matrix-columns}
As long as the number of rows matches the data frame, it's also possible to have a matrix or array as a column of a data. (This requires a slight extension to our definition of a data frame: it's not the `length()` of each column that must be equal, but the `NROW()`.) Like with list-columns, you must either add it after creation, or wrap it in `I()`.
```{r}
dfm <- data.frame(
x = 1:3 * 10
)
dfm$y <- matrix(1:9, nrow = 3)
dfm$z <- data.frame(a = 3:1, b = letters[1:3], stringsAsFactors = FALSE)
str(dfm)
```
```{r, echo = FALSE, out.width = NULL}
knitr::include_graphics("diagrams/vectors/data-frame-matrix.png")
```
Matrix and data frame columns require a little caution. Many functions that work with data frames assume that all columns are vectors. Also, the printed display can be confusing.
```{r}
dfm[1, ]
```
### Exercises
1. Can you have a data frame with zero rows? What about zero columns?
1. What happens if you attempt to set rownames that are not unique?
1. If `df` is a data frame, what can you say about `t(df)`, and `t(t(df))`?
Perform some experiments, making sure to try different column types.
1. What does `as.matrix()` do when applied to a data frame with
columns of different types? How does it differ from `data.matrix()`?
## `NULL`
\indexc{NULL}
To finish up this chapter, I want to talk about one final important data structure that's closely related to vectors: `NULL`. `NULL` is special because it has a unique type, is always length zero, and can't have any attributes:
```{r, error = TRUE}
typeof(NULL)
length(NULL)
x <- NULL
attr(x, "y") <- 1
```
You can test for `NULL`s with `is.null()`:
```{r}
is.null(NULL)
```
There are two common uses of `NULL`:
* To represent an empty vector (a vector of length zero) of arbitrary type.
For example, if you use `c()` but don't include any arguments, you get
`NULL`, and concatenating `NULL` to a vector will leave it unchanged:
```{r}
c()
```
* To represent an absent vector. For example, `NULL` is often used as a
default function argument, when the argument is optional but the default
value requires some computation (see Section \@ref(missing-arguments) for
more on this). Contrast this with `NA` which is used to indicate that
an _element_ of a vector is absent.
If you're familiar with SQL, you'll know about relational `NULL` and might expect it to be the same as R's. However, the database `NULL` is actually equivalent to R's `NA`.
## Answers {#data-structure-answers}
1. The four common types of atomic vector are logical, integer, double
and character. The two rarer types are complex and raw.
1. Attributes allow you to associate arbitrary additional metadata to
any object. You can get and set individual attributes with `attr(x, "y")`
and `attr(x, "y") <- value`; or you can get and set all attributes at once
with `attributes()`.
1. The elements of a list can be any type (even a list); the elements of
an atomic vector are all of the same type. Similarly, every element of
a matrix must be the same type; in a data frame, different columns can have
different types.
1. You can make a "list-array" by assigning dimensions to a list. You can
make a matrix a column of a data frame with `df$x <- matrix()`, or by
using `I()` when creating a new data frame `data.frame(x = I(matrix()))`.
1. Tibbles have an enhanced print method, which never coerces strings to
factors, and provide stricter subsetting methods.