Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making the CompressedList more widely usable #11

Open
LTLA opened this issue Apr 2, 2019 · 4 comments
Open

Making the CompressedList more widely usable #11

LTLA opened this issue Apr 2, 2019 · 4 comments

Comments

@LTLA
Copy link
Contributor

LTLA commented Apr 2, 2019

I've been playing around with the CompressedList subclasses for representing some complex data types and I've really come to like it. I've been thinking of ways to make it more generally usable by both end-users and other developers, and I've got a few wish-list elements:

Access to unlistData and partitioning. End-users would then be able to execute arbitrary unary operations on the underlying data while preserving the partitioning, like:

A <- DataFrame(X=LETTERS, Y=runif(26))
comp.list <- split(A,A$X)

# Attempt fails, for obvious reasons.
comp.list$Y <- log(comp.list$Y)

# Assuming we had a unlistData() method:
unlistData(comp.list)$Y <- log(unlistData(comp.list)$Y)

unlistData<- could even be unlist<-, if one were willing to introduce that concept. I don't mind if partitioning is getter-only; this would still be very useful for downstream functions that need to be list-aware yet don't want to create an intermediate list for efficiency purposes.

Non-virtual CompressedList class. I don't understand the motivation for making CompressedList virtual. From a representation perspective, a general concrete class would be useful if we could store any vector-like entity in unlistData. In fact, I ran into the case where I wanted to store a CompressedCharacterList as unlistData, effectively making a CompressedCompressedCharacterListList! I don't expect to be able to call many methods on this thing - other than the proposed unlistData and partitioning, and maybe unlist - I just want to use it for storage without needing to write an explicit subclass. A general CompressedList class would serve this purpose, and is better than the alternative of falling back to a SimpleList (which takes a noticeable time to generate).

A more careful unlist. If we do allow a general CompressedList class, the unlist method should probably take heed of recursive=TRUE and apply unlist on the unlistData slot.

I'm happy to chip in with a PR if these sound like good ideas.

@lawremi
Copy link
Collaborator

lawremi commented Apr 6, 2019

An unlist<-() has been suggested in the past (e.g., by @mtmorgan). It's a probably worth having but up until now we have managed by just adding methods for functions like log() and using relist() directly. We would welcome a pull request for unlist<-(), but it would also be nice to have log() and related methods for NumericList.

Making CompressedList non-virtual (and requiring Vector for @unlistData) is an interesting idea. A separate pull request is welcome, if only to spur discussion. Agree that it should consider recursive=.

@hpages
Copy link
Contributor

hpages commented Apr 6, 2019

Note that relist() already does what the proposed unlist<- would do on a CompressedList. So IIUC basically the unlist<- proposal would be to replace well-established idiom:

relist(as.character(unlist(x)), x)

with

unlist(x) <- as.character(unlist(x))

Personally I prefer to stick to relist() for several reasons:

  • It's a base R verb that everybody is already familiar with.
  • The relist(as.character(unlist(x)), x) idiom is more readable (but that's just my opinion).
  • It's also a powerful idiom that can be used on list-like objects in general (including ordinary lists), not just on CompressedList objects (unless the proposal is to generalize unlist<- to all list-like objects, but that's what relist() does already).

@lawremi
Copy link
Collaborator

lawremi commented Apr 6, 2019

I do like the symmetry, simplicity and safety of the unlist<-() syntax. relist() is in base, but it's fairly obscure. If we move forward, we should definitely make unlist<-() work on all types of lists.

@LTLA
Copy link
Contributor Author

LTLA commented Apr 6, 2019

Indeed, I didn't even know about relist until a few days ago when I was poking around inside IRanges.

My motivation for unlist<- is mostly driven by use with CompressedSplitDataFrameLists, where it provides a simple mechanism for switching between List-mode and DataFrame-mode.

library(IRanges)
X <- DataFrame(statistic=runif(100), more_stats=rnorm(100))
Y <- split(X, sample(LETTERS, 100, replace=TRUE))

# List-style getter/setter:
Y$statistic <- Y$statistic * 2

# Hypothetical DataFrame-style getter/setter:
unlist(Y)$statistic <- log2(unlist(Y)$statistic)

The relist syntax would require an explicit intermediate DataFrame. I guess you could argue that this is clearer, but it's inconvenient to have to split it across three lines (especially in interactive sessions).

Z <- unlist(Y)
Z$statistic <- log2(Z$statistic)
Y <- relist(Z, Y)

A recursive unlist<- also provides an approach for reaching deep into nested CompressedList objects, if it were possible to store a CompressedList as the unlistData of another CompressedList:

basic <- sample(LETTERS, 100, replace=TRUE)
nest1 <- CharacterList(split(basic, sample(10, length(basic), replace=TRUE)))

# Create a list of CompressedList instances (if CompressedList() existed)
nest2 <- CompressedList(split(nest1, sample(3, length(nest1), replace=TRUE)))

# recursive=TRUE as default
unlist(nest2) <- paste0("WHEE", unlist(nest2))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants