SE Radio 450: Hadley Wickham on R and Tidyverse

Hadley Wickham, Chief Scientist of RStudio and creator of the Tidyverse, discusses how R and its data science package the TidyVerse are used and created. Host Felienne spoke with Wickham about the design philosophy of the Tidyverse, and how it supports the clean and reproducible analysis of data. They discuss how different fields use data science and how the Tidyverse’s language design enables different ways of working with data. They also discuss how to test and evaluate new features within a programming language or package, and how to involve the user base in decisions on languages or packages.

Show Notes

Transcript

Transcript brought to you by IEEE Software (automatically generated)
(This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].)

Felienne 00:00:16 Hello everyone. This is Felienne for software engineering radio today with me on the show Dr. Hadley Wickham, he’s the chief scientist at RStudio, the leading IDE for R, and he’s also an adjunct professor of statistics at the university of Auckland, Stanford University, and Rice University, and the creator of the topic of today’s show, the dinosaurs package, which supports tidy data approach to data import analysis and modeling, and R. Welcome to the show Hadley.

Hadley Wickham Thanks for having me.

Felienne So first up, what is our, most of our listeners are programmers, but not statistician’s. So they might’ve heard of R as a programming language, but what is our, what makes it unique?

Hadley Wickham 00:00:56 Uh, there was a programming language that was designed specifically for the needs of statistics and data analysis. So it’s been around for a while now. That’s kind of the success of a programming language that was invented at 18 T in the seven days, I think like seven days out of the eighties. And it was originally designed to string together a bunch of C and full transcripts on the command line. But since then, that’s really grown to become this environment for very fluid and fluid data analysis and exploration.

Felienne 00:01:30 So you say that is a programming language that has specific features for statistics. Can you give an example of what is a feature that someone doing stats really needs in a programming language?

Hadley Wickham 00:01:41 So one thing that’s, that’s really going to face it and fundamental as this idea of missing values. So, you know, often you collect data and something goes wrong and you don’t know what a specific value is, and you need some way to recall that. And that’s the idea of a missing value. It’s kind of like a novel and SQL if you’ve used that very, very few programming languages have that kind of built in at a fundamental level. Like most of them like Python and Julia kind of have an added at a higher level. The other thing just in general is that I was really designed from kind of day one to be this interactive development environments. So even more than like a, a scripting language, like Python, not the fundamental way you use R as you type, you know, like a line or a couple of lines of card, and then you look at the output and then you think a little bit in the types of comments, it’s a very, very interactive.

Felienne 00:02:37 So that’s a little bit like what you might know from other programming languages as a rappel, or is it something different? So do we know how popular our is? Because I hear this all ours, very popular language for data analysis. Do we have a sense of how many users there are?

Hadley Wickham 00:02:55 I mean, it was really, really hard to tell, you know, I think multiple millions, certainly. And then it’s kind of a matter of like picking the programming language ranking that gives your favorite programming language, the highest ranking. I think it’s fair to say. Now, looking at across a bunch of different rankings are as one of the top 20 most popular programming languages. I think it’s, you know, some of them show it in the top 10 by think it’s up there. It’s pretty popular, but it’s mainly used for, by people doing data analysis. So it’s a little bit of a niche language. Like it’s not, I mean, it is a general purpose programming language in the sense that you can do anything with it, but it is primarily used by people doing data science and statistics.

Felienne 00:03:44 Do you also think that that’s the way it should be that people doing gaze analysis have their own programming languages, even though they might be also usable for general purpose? Is that a good idea or is it, would it be nicer if everyone used the same language?

Hadley Wickham 00:03:59 Yeah, that’s, uh, I mean, I think it would be nice. It would be nice if there was like one the programming language, right. That would be wonderful. Like you wouldn’t have to have different, the same things develop multiple times with different programming languages. But of course the problem is like one programming language can’t possibly solve every problem completely in the way that you want it. Like, if you, if you have one, just one programming language, you’ve got to make a bunch of compromises. And I think you would end up with a system where like part we know, when is that happy with that? So the certainly downsides to having like this kind of niche language specifically for pretty broad domain, but I think that the advantages of that you get something that’s really tightly tailored to the types of problems that you care about as a data scientist.

Felienne 00:04:45 Yeah. That makes sense. And R has been around for quite a long time. Do you think art is moving more away from being a general purpose language or is it adapting features to look more like a general purpose language?

Hadley Wickham 00:04:58 That’s an interesting question. I think it’s not the language so much it’s changing as the community and you know, what the community is developing when I started using R during my PhD, which is like 15 years ago or something, I was almost exclusively used by people with PhDs and statistics and that, and it was very, very tightly tightly connected to doing data analysis. And so not much has changed about the language since then, but the community has really exploded in the last 15 years, the kind of use cases and the things that people are using it for and what they need to do, their jobs is gradually expanded. And so that, you know, that kind of means things like the connecting to new types of databases and connecting to all sorts of a way of API APIs, where people get data from these days. And this is just, that’s just kind of mean that it’s strength as a programming language has kind of expanded in areas that wasn’t.

Felienne 00:05:58 So then I want to zoom in a little bit about one specific part of our, and that’s the tidy force that you created and designs. Why did you make this ID for what problem does it solve within the problems that are already,

Hadley Wickham 00:06:13 I mean, it solves a very similar set of problems. I think the thing that is kind of interesting and weird about the tidy versus in some scenes, it actually like restraints where you can do it kind of constrains your constructs, your, it gives you like a smaller set of options powered by some kind of like underlying theory design around the idea of like, really about composition. And in some sense, I think the Tidyverse is like a search for like that Adams of data science and knock how and a simple set of techniques, but combining those atoms up into molecules. So you can like solve any problem, hopefully using this like pretty small set of fundamental ideas and fundamental techniques.

Felienne 00:06:59 Did you make Tidyverse for your own use or was it from the start or any mens as something to serve the company?

Hadley Wickham 00:07:05 So it really got started like, well, before it was called the Tidyverse, when I was doing my PhD and as part of my PhD, my assistance-ship was a consulting assistantship. So I helped PhD students. How do the analysis for the PhD is not a statistics PhD students. And that was really a transformative experience for me because I was doing my PhD in statistics at the time, taking all of these classes about, you know, all these like really complicated mathematical theory and sophisticated models. But when it came time to actually like help people through the data analysis, the first problem you encountered is like, how the heck do you get the data out of whatever form of love started it. And, and, you know, people collect data and record data in a way that makes sense that makes it easy as possible, record the data, not to analyze the data. Is it the first problem you always had? An every data analysis is like, how the heck do you get it from the form? They have it in, into the form that you needed. And that’s really what started my journey and the sort of the ideas of tidy data and tidying data and reshaping data. Like just how could I get, how could I do my job with these datasets that I’ve never seen before stored in ways that frankly at the time seemed absolutely crazy?

Felienne 00:08:23 What are the challenges you would say to create a programming language for statisticians? I think one of the challenges you just said is, well, data comes in and Andy format. So we have to make it fits. That’s probably also where the feature of know values comes in handy that sometimes you have data in a, in a messy formats, but what other challenges are there if you’re designing a programming system that’s made for, but we programmers was goal non-programmers people that aren’t employed as professional programmers.

Hadley Wickham 00:08:53 Yeah. I think that’s a really, it’s really interesting to contemplate that. And that’s something we spend a lot of time thinking about and the Tidyverse kind of trying to balance, like how do we do live? Like experts. People have decided to want to learn how to master as a programming language and get really good at that versus the kind of casual users of our program as that just want to do like the smallest possible amount of our cards to get jobs. And so I think like one of the things I really believe in is if you’re not going to become a program, if you’re not going to devote hours and hours of learning a programming language, it’s really important to figure out what those, like the what’s the smallest set of big ideas you need to kind of get into your head, most programming languages.

Hadley Wickham 00:09:39 Those are ideas around like how functions work and how objects work. But our, I think one of the big ideas is like, how does data work? And this idea that we’re worked on off of like tie-dye day Alec, how should you organize your data? So that’s going to be easy to analyze after the fact. And I think like now I can kind of explain it in a scene tense. It’s took like 10 years to get there. But the basic is that you want every column in your rectangular data frame to be a variable. If you kind of look more of the theory of tidy data, you can see that some pretty close parallels to call it the normal form. But just like this idea, like trying to figure out, like, what is this key idea and how can we express it succinctly in a way that like, most people can understand what, like a relatively small amount of training. That’s something we spend a lot of time thinking about.

Felienne 00:10:31 I think we might want to go one step back there because you did mention a few things and those everyone in the audience might be familiar with. For example, you said a rectangular data frame has columns within it. Could you elaborate a little bit more on what that means?

Hadley Wickham 00:10:45 The main type of data that people work with an hour is like rectangular data. So you’ve got rows and columns in generally in most cases, I think to make data analysis easy, you want each column to be a variable, something that you’ve measured and each row should be an observation and a collection of related measurements. It turns out to be really difficult to precisely describe what a variable is and what an observation is. But generally you’re kind of like everyday language, the everyday knowledge people have of variables and observations tends to be pretty good once they’ve got that framework to organize a tailor in that rectangle.

Felienne 00:11:24 Yeah. So can we give an example of an observation and example of a variable, because I think variable means something slightly different here than it means for professional programmers in a program
Hadley Wickham 00:11:34 Tickets about all these different, this is, I think one of the, one of the challenges of teaching and particularly kind of interdisciplinary work is all of these words. That mean like totally different things to different people. And before I talk about what something, a data thinks of it as a variable, you know, statistician’s have this idea of a random variable, which is even like suddenly even worse. Cause it’s not a random variable is actually like a function. It’s not a random, no variable, which seems like a really confusing name as again, totally different to how you think about a variable in most programming languages. So a variable and the kind of data science stands. It’s something that you measure like white or height or age or someone’s name, or like a unique identifier. These are like the columns you’d have and database like collections of measurements basically that are related by these observations. So an observation could be a like a person. It could be a person at a specific time, a person at a specific time given those specific drug to try out some new therapy.

Felienne 00:12:43 Okay. I think that is cleared. So now the concept from a statistics are a little bit more related to databases and things programmers might know. You said something really interesting. You said that you wanted to Tidyverse to have a small set of big ideas, which I think is a great way to design something. And I think we covered one of those big ideas, like data frame. And how do you organize your data? Were there any other big ideas in your design?

Hadley Wickham 00:13:11 So I think another big idea is this idea of tidy evaluation, which I think the way I understand tidy evaluation, if I can explain it briefly, is this kind of blurring of this line between a variable in this kind of like data science, statistical stance, like something, one of those columns in the data frame and a variable in the computer science, and it’s something like that exists, then you can work with directly. This is like the reason why you want to blur this line is because you want often want to do similar operations to these like statistics variables as you do to these computer science variables. Like you want to be able to say, I want to compute someone’s BMI by dividing their white by the hype square. For example, now you, you’re kind of combining those two ideas of variables now saying I want to create a new column and new variable that is a function of two existing here.

Hadley Wickham 00:14:11 And another thing that’s interesting about, uh, which I haven’t mentioned, and in most programming languages, you’d have to, if you had multiple observations, if you had the white and high for multiple people, you would have to write a full loop so that you’d loop over every single person, get their white square to the height and divide the two. Whereas an ah, you’d express that in a single calculation because ours vectorize, or rather than just dealing with single numbers, it deals with Victor’s or a raise of numbers so that you would just to compute the BMI of any number of people. You just say white,

Felienne 00:14:46 And then it’s automatically applied to all the observations that you have. That brings me to another interesting question. What about scalability and performance? Is this something you look at at the programming language level? Because I can imagine that you define a simple function and then you might not realize that this has been calculated over everything in your datasets. We will help people write efficient code. Is your own engine really efficient?

Hadley Wickham 00:15:11 Yeah. So I think in general, for myself personally, at least I spend most of my time kind of thinking about the efficiency of the human. Like most people are dealing with like datasets that are like small enough that the computational cost isn’t the pot on the thing that you spend most of your time doing is thinking about what do you need to do next with your data? And then how do you express that? And so, you know, computational cost is certainly important, but it’s something that I think I generally spend a lot less, but that’s sort of not my area of expertise I care about like, how do we make the human as efficient as possible? That means for like me as a developer of our packages, I spend some time rewriting my R card and to CNC, pause, pause for efficiency and doing profiling and all the sorts of things. That program is doing every language to make the cards fast. But by and large, like that’s not something that most our users have to worry about because again, done all programs that dealing with, you know, maybe tens or hundreds of thousands of observations and just doing the most naive thing with modern computers.

Felienne 00:16:20 Yeah. So you say performance issues is really a sham thing that you typically run into. If you’re doing data analysis.

Hadley Wickham 00:16:26 I mean, that’s a little bit, maybe that’s a little bit strong. You’re probably still run into it a little bit, but it’s not the thing that I think like dominates most of the problems or like figuring out what you actually want to do. Um, do you spend like minutes, hours, days thinking about that? Whereas the computation might be seconds a minute.

Felienne 00:16:47 So we wanted to zoom in and a little bit more into the design decisions that went into making this ID first. And we talked about these big ideas, but I’m sure at one point there were also trade offs where you want it to have features, but you couldn’t have them both or one new language feature impacts. It’s another language feature. Can you give us an example of difficult trade-offs that you had to make in the design process?

Hadley Wickham 00:17:10 Yeah, I think the most common trade-off we kind of battle with is this tension between like experts and casual users of AR. And because we, as the developer of the packages, like, you know, we, we live in our we’re thinking about our 40 hours a week. It’s very easy to kind of develop features that make are super effective, but no one else can really understand how they work. This has been a couple of, kind of recent examples of that, that come to mind. The first of this is this sort of idea of tidy evaluation, which I talked about briefly before, where are like, we came up with this like really, I think really good underlying theory. And we got really excited about it. And then we wanted to share it with the world. And frankly, I thought the world was not ready for it in the world.

Hadley Wickham 00:18:04 Did not really care about it because it was this really sort of advanced feature that where you got excited about, but you know, most our developers, the other thing that’s come up recently, it comes back to this idea of vectorization like, you want to do something to every observation without having to write some kind of follow in our is a bunch of these vectorized operations, like all of the basic mathematics, you know, if you want to add or divide or multiply trigonometry function, the exponentials logs, all of that, all of that’s kind of built in, but sometimes you get to a point where you have a function that not vectorized. And so how do you apply that to every single observation? There’s something more complicated. Like maybe you want to fit a model or maybe your variable is a list of file pops. And now you want to go through every file path.

Hadley Wickham 00:18:55 That’s a CSV file and read them all into one shot. We’ve always, I think we kind of keep bouncing back and forth in some sense between like explicitness and magic. And like on the explicit side is like teaching people about functional programming because I was a functional programming language. Like how do you understand the ideas behind functional programming? So you can use it effectively versus like, how do we just give you some magic? So you don’t need to care about what’s going on behind the scenes. You don’t have to learn anything about like functions and functions that work with functions, and you can just kind of do what you want. And so recently in the latest version of deeply, which is the package for data manipulation, we have this kind of new row wise function, which kind of magically automatically vectorize things. So you transform a data frame in such a way that now every operation you perform on it, that’s going to be done on each row.

Hadley Wickham 00:19:52 Individually kind of happens behind the scenes. And this is, this is a tension where was thinking about like, what’s too magical, like when does a work and then let me magic is great when it works. But when it fails, you’ve got like no idea what went wrong and you’ve got no idea how to resolve it versus this tension between like, let’s teach you from the ground up, like what a function is, how a function works. What’s a higher order function. How does that work giving you like a very kind of detailed and correct model of the programming language, but now you’ve been in faced a lot of time into learning.

Felienne 00:20:29 Yeah. I really liked your summary there where you say it’s a trade off on one end of the spectrum. You have total magic where you’re happy if it works, but you don’t really know why. And then on the other hand of the spectrum, you have explicitness where you have to really detail every step you take, which is a little work, but then also you have a lot of control. I think that’s a struggle that many people that make so far for other people also recognize,

Hadley Wickham 00:20:53 And this sort of comes back to like this idea, like, what are the big ideas? Like what are the things that we really believe you have to learn before you can do this stuff correctly and safely and correctly and efficiently. And we want to keep that as small as possible. So you can kind of get into it as quickly as possible, get some wins there, you enjoy the process of learning. But at the same time, we don’t want to just say like, you know, it’s a, free-for-all did this some important like infrastructure, some foundation you need to build up over time. So you can have a good mental model of what’s going on and make better decisions in the future.

Felienne 00:21:29 Yeah. That totally makes sense. Has there been any features that you implemented and then later regretted where you thought, Oh, people don’t understand or they misused this wasn’t the way I wanted people to apply things.

Hadley Wickham 00:21:42 Yeah. So I, I now have my kind of over time, like my criteria for success of a system has changed. And now one of my criteria for success is that a package is not successful until someone has used it to commit academic fraud. So I, you know, I think that’s the fundamental challenge of like developing general tolls. Like if you want people to do like solve new problems with them, you have to accept that they might do things that you think are fundamentally wrong and there’s, there’s this sort of trade-off between freedom and safety. And if you can strain people to the sort of like single path where they can only do what you want, that’s pretty limiting. And there’s lots of things that you didn’t think of that might be really important generally. Like I’m not too worried about that. Now. I think you just have to accept that, that people are going to do things that you don’t agree with and you can’t kind of control that at the software level. I think you have, the base you can do is to kind of control that at like a social level and, you know, be like a positive role model and show people how to use these tools effectively and correctly. You, you can kind of put that in a fundamental way.

Felienne 00:22:56 Can you give an example of a feature that has been used for academic faults intentionally or accidentally?

Hadley Wickham 00:23:03 I forget the name of the guy, but he did this pretty study. These faked a bunch of data for his PhD. He got some pretty famous paper out of it and made like these most beautiful graphics using GG plot two. They were like very, very compelling graphics, but really they’re not really like scientific graphics that basically propaganda, because that’s the underlying data was made up.

Felienne 00:23:29 Has there been any feature that you had to change?

Hadley Wickham 00:23:33 Yes. A lot. A lot of thoughts. That’s something we spend a lot of time thinking about now. Like how do we change things I worked on before? So I think that the arc of tidy is kind of a good one to talk about that because there’ve been working on tools to help you tidy your data. So to go from whatever form you’ve currently got it into this form that you’ve got it all as a column in each role is an observation. And that’s something that I’ve been working on for like 15 years now. And they’ve been like full, I think, four major changes in the way that I think about that. So they’ve been over the course of 15 years, there’s been a package called reshape another package called reshaped to another package could tidy up. And most recently, two new functions and tidy pivot longer, which is a culmination of a bunch of changes. A bunch of feedback are the time people saying like, they couldn’t understand the functions or every time they use the function, they had to look up the documentation. It just didn’t stick in people’s head. And so that’s something like I find like, that’s one of the things that I love, like Twitter, because you can get a stream of, uh, things that like people were rubbing never mentioned to you in person, but like complain about.

Felienne 00:24:53 So would you say that’s the biggest reason to change a feature that it’s hard to understand for people, because I think that’s what you said there, that you had to change things because people wouldn’t have to look up the syntax or what they would do with wrongly every time.

Hadley Wickham 00:25:05 Yeah. I think like, particularly for what I care about, which is sort of facilitating non-programmers to do data science, to do data analysis and a programming language. Yeah. That’s really important. This was one of the other people, either. This was like me as well. Like I would use these functions, which I wrote sometime ago and I couldn’t remember how they work and I would have to read the documentation every time as well. At the time it made total sense to me. Like I have this experience a lot, like I’m programming something. Wow. This is amazing. Like this, this totally makes sense. This is a great way to think about the problem. And then I come back to it like three months later and I have like literally no idea what I was thinking. And you know, and that’s very much the experience of like a new program or they come to this like absolutely fresh. I’ve got a few preconceptions and if it doesn’t stick on the brine, I think that that can be a sign that there’s something wrong with the design of the function.

Felienne 00:25:59 Yeah. I’d love to hear more about how you get our feedback. So you just mentioned that, Oh, people on Twitter, they complain, or maybe they send you compliments of things. They like, I guess you also still use your own tools to do some data analysis. What are other ways that you get or feedback to note the direction of the language?
Hadley Wickham 00:26:17 Yeah, we do like a surprisingly small amount of kind of like quantitative feedback. So relatively recently we’ve started doing a few more like very casual surveys, which are mostly like, just tweet out a question and point people to a Google form and get sort of feedback on things, help both in terms of like trying to figure out what a common problems people are facing. And then to try out, like here are four possible solutions, which one of those kind of resonates with you. And this is always a little tricky cause we want to try and see what’s going to stick in your kids like in three months time and like measuring that is very, very tricky. The other thing that I think is really hard is people don’t know what they need. And it’s very difficult to kind of recognize that there’s a bunch of like maybe things in disparate areas that if you had one tool or one another big idea that help you solve this problems.

Hadley Wickham 00:27:13 Like to me, like that’s the thing that we are looking for. Like what are these big ideas? And, you know, people don’t know what they don’t know. And it’s very hard to articulate that. I think some of what we do, I think is just inventing words when you topics, because if you don’t have a name for a topic it’s really hard to talk about. And this I think was one of the, one of the ideas of like tidy data that’s being so impactful is just having this idea, like this name of tidy data. And yeah, you can talk about like, I’m tidying my day or my data isn’t tidy. This data is tidy. And so how, like one of the things that I think is really important, it’s like trying to discover these things that need names. And like, I have no idea how to do that except to talk to a bunch of different people and try and keep my eyes open the form of repeated patterns.

Felienne 00:28:01 Yeah. That’s really interesting because we say you do data analysis, but that’s still relatively broad. I can imagine that probably you have different types of users, like maybe a biologist does data analysis is different from a linguist or different from a computer scientist. Is this something you see where different sub fields of science have different uses, better patterns for data analysis and for tidy versus

Hadley Wickham 00:28:25 So most? I mean, I th I think there’s a huge amount of commonality, particularly at the early phases, which we kind of focus on like getting the data and getting a tie-dyed and doing like a bunch of visualizations. So you can figure out what’s going on. Like that’s hugely some, a lot across fields. And that said, you know, people in different fields give all these the same things, different names and tech things in slightly different ways. But I think that live all, you know, there’s a huge, huge amount of commonality. Things tend to get a little bit different, like in terms of the detail, like precisely what visualizations are you’re going to put on a paper, or what models are you going to use? There’s a lot more variation and kind of the customs of academics that tends to be something that I spend my time so much time thinking about.

Felienne 00:29:11 So it is similar enough, like what every scientist does in terms of data analysis is similar enough that you don’t need shepherd packages for being a biologist or a computer scientist.

Hadley Wickham 00:29:22 I think some pods you do, because you’re going to deal with like the student types of data file that Arnie biologists deal with know computer scientists, if you’re publishing, and if you’re publishing a paper and there’s, you know, there are models that people are going to expect that you do use, or that you, you know, the reviewers are going to understand and it’d be easy to sell on. So certainly like maybe 20% of the analysis is very specific to a given domain. And every domain needs a bunch of different specialized tools for that. But like, there is a solid 80% at the heart of it. That is the same, like that’s just about data and working with data and it doesn’t matter what field you’re in.

Felienne 00:30:02 And how does the tidy version are compared to other tools for data analysis? Of course, if I think about data analysis, I think of something like Excel, or maybe in Python, you have a pandas, which I think sheriffs some of the big ideas of the tidy version in terms of data frames and stuff like that and missing values. So how would you say are compares to alternatives for data now?

Hadley Wickham 00:30:26 Yeah. So you kind of identified the two big families, the whole Tana’s, which are like pointing like GUIs and all the programming languages. I think if you want to call yourself a data scientist, I think like one of the distinctions between data science and data analysis generally is that data science is done in the programming language, because I think programming languages have this kind of immense advantage that, that just ticks that you represent the sequence of operations you’re performing in text in text is like amazingly powerful because you can like a Google edge. You can copy and paste it. You can email, you can turn your problems into something that you put on an email and seen somewhere else. And then of course it’s all like rerun a ball and reproducible. So I think if you, as someone who like the, now of course there’s a big disadvantage to programming languages.

Hadley Wickham 00:31:20 Well, the big advantage of a gooey is kind of everything is spelled out in front of you. You got all these menus and these buttons, and maybe you can’t like, remember what they all do, but you can kind of like search through all of your options of light out in front of you, whereas with a programming language, like you have to remember what you can do and string all the pieces together. So gooey is I think a great, like for casual users, but people that aren’t doing data science enough to like, remember what all the commands are, but I think programming languages for data, other kind of, one of the key distinguishing features.

Felienne 00:31:52 Yeah. Because they are text-based and that helps you do diffing and storing and reproducibility.

Hadley Wickham 00:31:58 I mean, and then copying and pasting from stack Harbor fly one of the most programmatic techniques.

Felienne 00:32:05 And then my thing, again, this trade-off between explicitness and magic in a certain sense where at gooey as lots of magic, it does a lot for you without maybe knowing a lot, whereas in texts you can be really explicit, which is not

Hadley Wickham 00:32:18 Reproducibility perspective. Yeah. And then the downside of the gooey too, is you’re kind of fundamentally constrained by what the author of the Geary wanted to allow you to do. And if you want to break out of that and do something different, that tends to be very, very difficult.

Felienne 00:32:33 Yeah. Yeah. Then you probably still have to script it yourself using a programming language you would, in a way, if that is exactly.

Hadley Wickham 00:32:40 No, I mean, Excel is kind of an interesting case because it’s not just a gooey Israeli, a programming language. And I remember when I was in, I guess I wasn’t, I would’ve been in high school that someone showed me a rotating 3d plot that, that made an Excel on my, my mind was just not using visual basic or anything, but just to constructing in cell formulas to do all of the trigonometry, to rotate the cloud of points in 3d and my mind was just blown.

Felienne 00:33:11 Yeah. It is pretty amazing what you can do with Excel, but indeed it does suffer from some of the limitations where it’s harder to version control. For example, it’s harder to email your calculations to someone.

Hadley Wickham 00:33:22 Yeah. I also remember, like when I taught briefly taught a class that was our Excel and SAS, SAS as one of the, kind of all the like very specialized programming languages specifically for statistics. And one of the things that I found really interesting is that it was actually quite difficult for me to teach or to show people what I was doing in Excel, because there’s a bunch of things where like a tiny difference in what you do with a mouse makes a big difference. Like how you write clicking or I, you lift clicking or you double clicking, like sometimes like when you’re dragging certain things, it’s just like a being off by a pixel. Two gives you a completely different results that sort of opened my eyes to like, I’d been a big Excel user for a long time, but just trying to show other people what I was doing was really, really difficult. Whereas with our, I could always like copy and paste it and put it in a file that you could run it. You could do exactly the same thing I did on your computer.

Felienne 00:34:21 Yeah. And, and given the reproducibility crisis in science where I need, sometimes it might be people that accidentally just for goals, how their spreadsheet works because it was years ago. But sometimes also therefore the impulse, this does seem like a very important feature nowadays.

Hadley Wickham 00:34:36 Yeah. I think the other thing that’s sort of interesting and I guess there’s one other style of programming, which has its proponents in data science. It’s just sort of drag and drop component, visual programming style, where you like have a bunch of boxes with each represented and operation and you connect them together with lines. And I think like that has like a lot of the weaknesses of programming without having all the strengths. Like you’re given this tool that can do anything. You’re just creating a graph of arbitrary conflicts, but you have no longer have all of these tools that programmers have around like diffing and debugging and emailing the people to get the help they think like that. Cause if communication side of COVID is really, really great.

Felienne 00:35:22 I also want to zoom in a little bit about the technical side of our 31st. So how does it work technically, is it like a compiled language or an interpreted language?

Hadley Wickham 00:35:31 So it’s an interpretive language. Although like, as a package developer, we also package develops in general also right under the hood are as I’ve written in C. And so you can also write C card if you want to switch into a language designed for much higher performance, but yeah. Fundamentally interpreted kind of interactive language, really designed around the ripples. So you ask a question of your data and get your answer back as quickly as possible.

Felienne 00:36:01 And if you would write the extensions for modules, would you typically then write our modules in R or do people then fall back to see,

Hadley Wickham 00:36:09 I think the vast majority of our community, like we call some packages instead of libraries or modules. So the vast majority of packages are written. And I think the two main reasons to write them and see is either performance or because you’re connecting to some other pre-existing software library that’s been written as a, for example, like a package that works with XML data is basically just a binding to let X amount to. So it just provides an interface. So the user can ride our car and calls a secret written by other people to do all the usual things you want to do.

Felienne 00:36:47 Yeah. That’s an interesting point. That brings me to a next question. Like how does our interoperates with other systems you just said, Oh, you can load XML data. Can you also embedded with other programming languages? Like, can I have some statistic analysis in our own, the front end with JavaScript or something like that?

Hadley Wickham 00:37:05 Yeah, I think so that really like historically, like I was this kind of glue language between C and Fortran and that’s, I think a really big part of Oz DNI. It comes back to your question about kind of performance that in some ways, the point of ours is not to be a like really fast programming language, but it should be able to talk to really fast programming languages really easily. So obviously it has like really strong connections. Let’s talk if you want to do for data for like databases and X amount and Jason, and this really strong connections to C and Fortran from a sort of statistical history, more recently connections of both directions to Python. Some people have experimented with tying to use ROS to provide the high-performance packages. And then of course, a bunch of connections to JavaScript, mostly from art to JavaScript. One of the things that my colleagues had asked you to work on is this, this tool called shiny, which makes it really easy for an opera to basically make a web app. And then all of the JavaScript, you know, there’s not a lot of communication with jobs.

Felienne 00:38:13 So then that moves allow you to create maybe an interactive visualization on the website based on data nurses done in AR.

Hadley Wickham 00:38:21 Yeah, exactly. The other thing we see it used for a lot is kind of like, historically, if you were producing a report for someone, you know, you would want to make sure you’ve produced every possible graph that they might ask for. And so you just give them this like 800 page PDF as every possible graph they might’ve asked for, which is obviously not a wonderful user experience. And so now we’re seeing people kind of transitioned to using shiny apps with that as a data scientist, you can give the person who needs to understand the data and make decisions based on that. You can give them some kind of like limited analysis abilities so they can try it a few different scenarios. They can ask different questions of the data.

Felienne 00:39:02 So that is the scenario where you have data analysis and AR, and then you projected to another programming language like Java. Correct. What about the reverse where you build maybe something like a web app that’s collecting data, that’s running your experiments and then can you get your data back into our and run a live data analysis?

Hadley Wickham 00:39:20 Yes. So one of the things that I kind of live data that I’m most familiar with, and we use a lot internally as Google sheets. So we all collect data like using a Google form, which would go into a Google sheet, or maybe we’ll be doing data entry in Google sheets directly because it’s such a like fantastic like collaborative environment. So we, for example, we’re working on a studio of global online conferences coming up. Soon, we in the program committee, we use Google sheets a lot because we can have a live, we can have a meeting and then you have three of us entering data into a Google sheet. And then we have an asker that pulls that into, ah, like does some like basic sanity checking, like have we scheduled multiple things at the same time? Like if we made one person gets saved and talks or that, that kind of stuff. And that’s just a really nice pattern was kind of like using like Google sheets, but what it’s really good at and using for what it’s really good at and having this like live live off time where we can easily see what’s happened after you’ve made some changes.

Felienne 00:40:22 And that’s certainly that glue functionality that you talked about before, where R is the glue between something that collects data and someone that wants to look at data. So what are your big plans for the future? What features are you working on? What trends are you excited about?

Hadley Wickham 00:40:40 I think generally it feels like we generally have this like punctuated equilibrium, like where we go through periods of like rapid, fairly moderately large change, and then periods where we just kind of like live the dust. It’ll let people learn about what’s changed recently. And it feels like we’re coming out of that period of a bunch of change and into a period where we are like consolidating where improving our educational materials were like making sure, uh, with like smooth or the rough plus sharp corners off all of our functions. So those are the easiest possible. And again, this is sort of one of the things that kind of rely on Twitter for is like, just in general, like how has the community like feeling about change right now? Do people want like big new things that they’re going to have to learn about or my make their lives better, but it’s going to require some investment or do they want to stay where they basically are? And I think kind of the message we’re getting at the moment, lots of change going on elsewhere in the world that people basically, I think just, just want to stay where they are and have a rough corners polished down. Like everything like fit together a little more tiles, better error messages or messages, all that kind of stuff

Felienne 00:42:03 Featured in other programming languages or other ID ease or programming systems that you’re like jealous about. You’re like, Oh, I wish we could also have that. Or I wish I had to time to implement that idea into tiny four
Hadley Wickham 00:42:15 And then like very, very kind of big side of things. And things will take us every year ever did start. And then once, I guess, multiple years, like I love the idea of kind of like type inference and bringing in kind of gradual typing and the way that PHP and Ruby and other languages have. So you can start to be a little bit more explicit about what types of data you expect your function to taken and to produce. And then as a bunch of like benefits from performance to documentation, to error messages to testing that, that just seems like something I would love like, like Haskell has or rust has stuff like that that allow you to, to make it more explicit and more clear, like, what are the inputs and outputs of your function?

Felienne 00:42:59 Do you implement that because it’s too much work or because it doesn’t really fit the language. I can imagine that if you have this missing data type, that that makes Stipe inference really hard, if that’s allowed and we need some sort of new type that overarches all the audit time.

Hadley Wickham 00:43:15 Yeah. I think the biggest thing is like, is clearly a large amount of work and it’s clear exactly what the payoff is. Like. I think it would be really cool, but would that actually help enough people to justify, you know, spending probably two or three program a year’s worth of work on it. And you know, this is a generalist, my personal challenges. There’s lots of like really cool programming things that I love and think are really awesome and amazing. And it would be fascinating to explore in more detail. I’m not sure how anyone else really to care about them that much, or whether that actually have a positive impact on people’s lives.

Felienne 00:43:51 It is hard to know of course, if a new feature really makes something better, like you just add better error messages, like what is better, like more detail might help some programmers, but might actually harm the productivity of what a developers, because then I have to read the entire error message, which might not be so useful. So, so how do you know if your language feature is better? Is it just good feeling because you, you said before that you didn’t really do any field survey.

Hadley Wickham 00:44:18 Yeah. I mean, I do mostly, I think it’s like a combination of gut feeling like of trying it out and then seeing how it feels to us as like data analysts in then, you know, just talking it, like bringing into the community and kind of saying like, how does this feel to you? But it’s a really, really, even if we did kind of invest significantly, it’s not clear to me like how you even measure the things that we care about. At least I realized like one metric that I care about is kind of time spent and flow state. Like I think in some ways that’s the key metric I would love to optimize. Like, I want you to be in flow state where you are like getting code out of your brain. Your fingers are just like typing a card. You have to think about it relatively little you’re fluid. You’re not searching your brain. Like, that’s kind of what I want to optimize. How on earth do you measure that? And then how do you kind of like measure? It’s not just like in the moment, it’s like combination of how long it takes to learn it all. And to get into that place where you can do this really, really, really hard.

Felienne 00:45:29 And they just felt sometimes languages, berming language China’s called, like the language gets out of the way you just allow you to program and the language isn’t giving you an annoying, pop-ups like, Oh, you know, this type doesn’t match what you’re doing. Really. It’s almost like the programming language. Isn’t there. You can just express your thoughts directly, but that is really hard to acquire of course, in the programming.

Hadley Wickham 00:45:50 Yeah. But there’s also at the same time, like the programming language also has to like help shape your thinking in a way that kind of guides you naturally towards a path that’s more likely to be successful. And again, I just have like, no, I, I mean how you do that is really, really tough. Really feel like a kind of an art or a craft, not a science right now.

Felienne 00:46:13 Yeah. And it does seem, even though you say that you don’t really know how to do it, it does seem like you’re doing really well because it’s already first are really popular our package and it has supported many statisticians and also other scientists over the years and doing those of data analysis.

Hadley Wickham 00:46:27 Yeah. I mean, I think we’ve been pretty successful. I think we managed to do quite a good job. Like I don’t know how we do it. I think we’ve done it enough successfully. That doesn’t seem to be, it’s not just luck, but I don’t know exactly what it is that like, how we think about this problem. Like what are the skills where you bring up here that are really the most useful?

Felienne 00:46:45 Yeah. It might be a combination of experience. You have a scientist yourself, really doing a PhD, doing data analysis and then getting to know the problem space really well. And then designing something that does seem to be key. I think.

Hadley Wickham 00:46:58 Yeah. The other thing that’s interesting to me as I kind of get more senior in age is as sort of new folks join my team. It’s clear, like my there’s people on my team and on the community and in the world whose kind of rule programming power is much greater than mine. They can churn out card much faster than I can, but somehow I have this like ability, the skill that I have is seeing like two or three steps down the road and being like, Oh, these two different things, they’re going to have this kind of conflict. You’re going to gonna run into problems in like a month’s time. If you continue down this path. And that’s just a really interesting to see like how, where your skills are when you work closely with other people. They’re like very, very, very talented.

Felienne 00:47:46 Yeah. And I would say that’s probably due to your experiencing or working with these types of systems and having developed them that you don’t need to be the most efficient programmer, but if you can indeed envision what features people might like, how they might combine together, then that is very interesting.

Hadley Wickham 00:48:02 Yeah. Which is not decided by that. I still spend a lot of time on dead ends and writing a bunch of cards that may eventually throw out. But I think that’s the other thing I’ve become more comfortable with over time is just accepting that like creating something good is often about like throwing out stuff, throwing out things that you’ve spent potentially a lot of time on whether that’s like code or writing, or I think even like, like biking, like the way that people present, like these beautiful trays of cookies is that there’s also like another tray of really ugly cookies that you never get to see.

Felienne 00:48:38 Yeah. And probably also like a hundred trays of cookies in their past that they bake. There were only just a little bit happier. Yeah.
Hadley Wickham 00:48:44 Yeah, exactly. And it’s very like, you can’t see that, like, it’s very hard to see that from the outside, you look at someone and you see, you see all the things that I’ve been successful at and presented to the public. You don’t see all the things that type of tried and experiments not succeed. Yeah. That’s true.

Felienne 00:49:03 That was all I wanted to ask. Is there anything that we missed? Anything where you say, Oh, I really want to share this and this fact about the dining room?

Hadley Wickham 00:49:11 I think so. I mean, the main thing is if you’re kind of interested in learning more and tidy has a website, of course. And there’s also a book that I wrote with Garrett, Roman called half a data science, which is like rarely our teams to take someone potentially hasn’t programmed at all that helped them become competent, happy data scientists. So that’s available online for free. You can just get a call off of data science and get the details there.

Felienne 00:49:38 We’ll put the link to our and the Tidyverse and also to your book in the show notes. So people can easily find them. Is there anywhere else we can find you on the web? Are you on Twitter or do you,

Hadley Wickham 00:49:48 Yeah, Toyota. I must say on Twitter, am I spending probably too much time on Twitter? I didn’t have a personal blog, but we have a team tidy vest blog. If that’s a tidy verus hall, that’s the best place. I think that’s mostly of interest if you use the Tidyverse. Cause that’s where we post all the release announcements. That’s where we kind of post important. Like, Hey, we’re thinking about making this really big change. Like please give us feedback kind of thing. So I think that that’s a great place to hang out if you’re on Somalia.

Felienne 00:50:20 Great. So we’ll make sure it will be. We’ll also link to the Tidyverse blog and tune your Twitter accounts in the show notes. Thank you so much for being on the show.

Hadley Wickham 00:50:28 Thanks Felienne. Thanks for having me.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 450: Hadley Wickham on R and Tidyverse

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 714: Costa Alexoglou on Remote Pair Programming

SE Radio 713: Héctor Ramón Jiménez on Building a GUI library in Rust

SE Radio 712: Dan Lorenc on Sigstore

Menu

Recent posts

Search

Search

SE Radio 450: Hadley Wickham on R and Tidyverse

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 714: Costa Alexoglou on Remote Pair Programming

SE Radio 713: Héctor Ramón Jiménez on Building a GUI library in Rust

SE Radio 712: Dan Lorenc on Sigstore

Menu

Recent posts