Search
Eddie Aftandilian

SE Radio 533: Eddie Aftandilian on GitHub Copilot

Eddie Aftandilian, Principal Researcher at GitHub Copilot, speaks with SE Radio’s Priyanka Raghavan about how GitHub Copilot can improve developer productivity as it is integrated with IDEs. They trace the origins of developer tools for productivity right from integrated developer environments to AI-powered buddies such as GitHub Copilot. The episode then takes a deep dive into the workings of Copilot, including how the codex model works, how the model can be trained on feedback, the model’s performance, and metrics used to measure code that the pilot produces. The show also explores some examples of where the Copilot could be useful — for example, as a training tool. Priyanka asked Aftandilian to respond to negative feedback that has been directed toward GitHub Copilot, including a paper that has asserted that it might suggest insecure code, as well as allegations of code laundering and privacy issues. Finally, they end with some questions on the future directions of the Copilot.


Show Notes

Related Links

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Priyanka Raghaven 00:00:17 Hi everyone, this is Priyanka Raghaven for Software Engineering Radio, and today we’re going to be discussing the GitHub Copilot and how it can improve developer productivity. For this, our guest is Eddie Aftandilian who works as a researcher at GitHub. Eddie received a PhD in Computer Science from Tufts University where he worked on dynamic analysis tools for Java. He then went on to Google where he again worked on Java and developer tools, and then of course he’s now a researcher at Github working on developer tools for the GitHub Copilot, which is an AI-powered co-generation tool, which is integrated into VS code. In addition to working on the Copilot VS code plugin, he also works closely with OpenAI and Microsoft research to improve the underlying codex model. So you’re a perfect guest for the show, and welcome to the show Eddie.

Eddie Aftandilian 00:01:13 Thank you. I’m very excited to be here.

Priyanka Raghaven 00:01:15 Okay, is there anything else you would like listeners to know about yourself before we jump into the Copilot?

Eddie Aftandilian 00:01:21 So, as you mentioned, my background has been in various types of developer tools, so dynamic analysis, static analysis tools at Google. And so, I have a soft spot for, especially, for static analysis and detecting common problems as part of the developer workflow and helping developers write better code in that way, as well.

Priyanka Raghaven 00:01:43 That’s great because the first question I wanted to ask you before we actually go into the Copilot, considering your background, so there we’ve had the days of VI and then we’ve had the days of WIM and then of course it got better with Emax probably showing my age now, and then we’ve had IDEs from like from Eclipse to VS code to Sublime Text to IntelliJ. What do you think about this integrated development environment? How has it really contributed to, say, developer productivity?

Eddie Aftandilian 00:02:10 I think IDEs have contributed greatly to developer productivity. So, when I started programming in college, we all used WIM and I actually still use WIM today for certain tasks, but when I need to do anything more substantial, I use an IDE. These days it’s usually VS code. When I was writing Java, it was IntelliJ, and then before that it was Eclipse. I find it very helpful to be able to do things like jump to definition, find usages of symbols — these kinds of things, and auto complete is a big help, especially things like refactorings and the built-in warnings and static analysis are a huge help to me. I’m a big fan of IDEs. I think IntelliJ is particularly impressive. I think they do a really, really good job with their refactorings and static analysis, and honestly when I’m trying to do more substantial coding work, if I’m not using an IDE, it kind of feels like I’m trying to work with one hand tied behind my back. I depend heavily on IDEs these days.

Priyanka Raghaven 00:03:11 Okay, that’s great. The next question I wanted to ask you from IDEs, we’ve had this area of research called co-generation or co-generators. So in Software Engineering Radio, for example, we’ve done shows on model-driven architectures then, model-driven code. We recently had an episode 517 where we talked about co generators by another host and there they basically talked about UML specifications or open API specifications and how that could be converted into code. And I was wondering if this area of research where there’s an idea of an AI-powered buddy, did that all come from this area of research which is yeah, code generation?.

Eddie Aftandilian 00:03:47 I can’t say it did, I can see the connection but from my perspective the idea behind Copilot came from a combination of the existing auto complete in IDEs that you see, combined with sort of the emerging capabilities of machine learning models. In my time at Google — so Google has this giant monolithic code base and it has a very nice code search tool that helps you find code and sort of has IDE-like features that lets you jump to the definitions of symbols and see all the usages of the symbols. And one thing I observed at Google was that almost any time I was writing a piece of code, someone had probably written the same code somewhere else in the Google Mono-repo. And so, I was spending most of my time looking through code search and trying to find examples of where other people had done the same thing, that I could use as a template for what I was trying to do.

Eddie Aftandilian 00:04:40 And from there it seemed pretty plausible that a machine learning model could be trained on this type of data and learn those patterns, and then the human no longer has to go search for these things, but the model can bring you the examples and adapt them to your context in a much quicker way that doesn’t take you out of your flow. So, from my perspective, that’s where this idea came from. But, these types of ideas tend to form simultaneously from a bunch of different teams. So, other people may have come at this from different directions and ended up in the same place

Priyanka Raghaven 00:05:11 Since we have an expert on the show coming from that idea, there’s another one that I keep seeing in the literature whenever you Google search Copilot, it’s called the GPT or the generative pre-trained transformer. What is that? Could you explain that to our listeners?

Eddie Aftandilian 00:05:26 Sure. So GPT is the name for the natural language models that are produced by OpenAI who are our partners on Copilot. So generative means that they generate text, they generate the next token in a sequence. So you give them a bunch of text and they try to predict what comes next. Pre-trained means that the model has already been, it comes trained out of the box on kind of a general task. It’s this task of predicting the next token, but it can also be adapted to other tasks. So sometimes you can just give it examples of what you want it to do that are slightly different from what it was it was pre-trained to do and it will do them and sometimes maybe you fine tune the model for a slightly different task by showing continuing training on a slightly different data set that where the target task is a bit different. And transformer refers to the architecture of these models. The transformer is kind of the standard architecture these days for large language models. They were introduced in a like very influential paper from 2017 from a number of Google researchers and transformers have become kind of the dominant way of constructing these large language models.

Priyanka Raghaven 00:06:40 Very interesting. We’ll probably deep dive into this in the next section, but before we do a little bit deeper dive into the Copilot, is there something else that you could give us a little more context in terms of what is the exact problem that the Copilot is trying to solve? Would you say it is developer productivity or could it be a training tool for learning a new language?

Eddie Aftandilian 00:07:01 I think it could be any of those things. I think the core goal is to suggest code to the user that the user finds helpful for whatever reason. Maybe they find it helpful because it accelerates their coding or it keeps them in the flow so they don’t have to switch off to do a search or go look on stack overflow, but the help is right there in their IDE. It might be that it gives you a skeleton of how to accomplish the task that you’re trying to do. And you have to adapt it a bit, but having the skeleton is helpful and it also could be that it’s helpful when you’re learning a new programming language when you don’t know the idioms. Maybe you’re an experienced programmer but you don’t know how a particular task is accomplished in a different programming language, but you know how you would do it in your native programming language. I think Copilot can be helpful for all those things.

Priyanka Raghaven 00:07:49 Yeah, I can especially remember when I started programming in Python or sometime back I had a big problem going from say Java or C# to Python because it’s like where are the types, where’s my semicolons? So maybe an AI-powered buddy would’ve helped. And the last question I want to ask you before we move on the next part, which is how long was the Copilot a research project and when did you decide to actually release it to a select set of users to now it’s current where you’re actually charging for it? Could you tell us a little bit on that?

Eddie Aftandilian 00:08:19 Yeah, of course. So to my understanding, and I wasn’t at GitHub yet at this time, Copilot started sometime in 2020 as a collaboration between GitHub and OpenAI. By the time I joined the team in March 2021, Copilot was a prototype and we released it as a technical preview to the public in June 2021. And then just this past June 2022, we made it generally available to developers. So now in the technical preview phase we had a wait list and people had to apply to use it and now anyone can use it. There’s a free trial if you want to continue after the free trial, it’s $10 a month.

Priyanka Raghaven 00:08:58 Okay, that’s great. So now that we’ve done with a bit of the introduction of the Copilot, I want to deep dive into a little bit on the workings of the Copilot in the sense could you explain to us how the Copilot works — essentially also, if you could just touch upon few of the things that our software engineers would be interested in. For example, how do you get such a good performance considering you’re crunching code from a lot of databases like public repos?

Eddie Aftandilian 00:09:25 At a core level, the way that Copilot works, there’s an underlying machine learning model. It’s called Codex, it’s related to GPT-3. So we talked about GPT models before; it’s produced by OpenAI. It’s focused on generating code as opposed to natural language, which is what the GPT-2, GPT-3 models generate. The way that these models work is that you give the model a prompt, and the model predicts what should come next. It predicts the next chunk of text, and then under the covers it produces a, let’s say a word or a token at a time. And then you form that into a longer sequence based on probabilities and such. You can ask it to generate a sequence of tokens up to a certain length that’s a property of the model. So, in Copilot we connect up to the model by collecting context from the user’s IDE that we use to construct a prompt, and then we pass that to the Codex model.

Eddie Aftandilian 00:10:25 And sort of the simplest way that you might do this is, imagine you’re editing some file in your IDE and your cursor is at some point, let’s say in the middle of the file, you could construct a prompt by just taking the content of the file from the start up to where the cursor is and then the model will predict what comes next. The way we do it is more complicated than that, but that’s kind of the baseline. That’s what sort of the simplest thing you could do that would produce reasonable results. Let’s see, when the model produces a suggestion, we display it to the user in the IDE and we display it in in light colored text, we call it ghost text. The user can either hit tab to accept it just like normal auto complete or they can keep typing to sort of implicitly reject it.

Eddie Aftandilian 00:11:13 In terms of how do we get such good performance, one thing about the architecture here is that the underlying Codex model, it’s a very large model, it’s not feasible to run it locally on a user’s machine. So we run these models in the cloud, we run them on Azure machines with very powerful GPUs. Some of the performance we get is because of the level of hardware that we’re able to use. Part of the performance here is just very strong performance tuning engineering from both OpenAI and our partners at Azure. They put a lot of effort into optimizing these models and making them run fast, so that people get reasonable completion times less than half a second, less than three milliseconds in their IDE when they’re using Copilot.

Priyanka Raghaven 00:11:53 I can vouch for that. I’ve been using it a few times and yeah it’s been great that way. Just to follow up on that, one thing that struck me was when you talk about the context of the code base, you did allude to the fact that it looks at the file til the part where the cursor is, but does it also look at Git history of that file or the whole tree structure of that? Is it only the file or the whole tree structure of the project?

Eddie Aftandilian 00:12:17 It doesn’t look at Git history, it doesn’t look at tree structure. It does look at context from other files that are open in the editor. So, imagine you have multiple windows and you’re flipping back and forth. There’s a good chance that the files you’re flipping back and forth between are relevant to whatever task you’re currently trying to accomplish. And so, we inline snippets from other files that are open in the editor into the prompt and we actually see quite a large performance boost from doing that.

Priyanka Raghaven 00:12:47 Okay. So that you can yeah, be predictive considering that you might switch to the other window. Okay, cool.

Eddie Aftandilian 00:12:53 Right, like imagine you’re writing code and you’re doing this thing that I described earlier. You’re looking for other examples of how to do whatever task you’re trying to accomplish, but you’re looking at it in your local project. I think that’s a pretty common thing that people do. So you can imagine that whatever you’re looking at in the other window is probably pretty relevant to the thing you’re trying to do in in the current file, even though that’s not the file you’re working on.

Priyanka Raghaven 00:13:15 Okay, gotcha. The other question I wanted to ask is, would the Copilot work differently if you were an English speaker versus if you were not one? Now is there an advantage to being an English speaker?

Eddie Aftandilian 00:13:27 So, this is a good question that we’re actively investigating, but I don’t have an answer for you yet.

Priyanka Raghaven 00:13:34 Okay. Then I guess the other thing I would ask is I was following the Copilot Twitter handle as well as your Twitter handle and one of the things I remember from your tweets sometime back was that you’d said you’d used the Copilot to build the Copilot. So can you elaborate a bit on that? How did that work out?

Eddie Aftandilian 00:13:51 Yeah, so I mentioned that when I arrived, Copilot was a prototype. It was already a VS code extension. Those of us who worked on Copilot all used that extension to further work on Copilot. So, in some sense Copilot helped write itself. I found it very helpful. You asked a question earlier, or you alluded to Copilot being helpful when you’re learning a new language. That was what I did when I joined the Copilot team. I previously worked on Java; I had been a primarily a Java developer for the last 10 years and Copilot is written in TypeScript and then we have other code bases that are primarily Python. Both were, I’d never written any TypeScript and I’d only written a small amount of Python, and I found Copilot very helpful in helping me ramp up quickly and write production-quality code in these new languages.

Eddie Aftandilian 00:14:43 I think the neatest thing was that it would teach me aspects of these languages that I hadn’t seen before. So, one anecdote here is sometime in Copilot I was writing some code to take options from, I don’t know, some arguments to a function or something and then merge them with a default set of options in this options class, and Copilot suggested that I wrap the option type in this partial type that’s in TypeScript. And what partial does is it takes properties that are required on a type and makes them all optional. And I guess the pattern of how you do this option merging in TypeScript is you have a fully formed option or fully formed options object and you take a partial object and kind of just lay it on top of that and override the default values and you produce a fully constructed options object with all the required properties there. But I had never heard of this partial type, I had never seen an equivalent in another programming language, and so I had to go off and Google what partial was, but it was exactly what I needed there and also kind of the idiomatic way to do this in TypeScript. Copilot taught me this tidbit that I don’t know how I would’ve learned otherwise.

Priyanka Raghaven 00:15:56 Okay, that’s really neat to hear, and I think that’s probably one of the quickest ways to learn the language because otherwise you’d be talking to someone in the office or a buddy whatever, so they are, this is good to know all that. Anyway, that’s now moot with Covid times and things like that, so this is good to know but in in this context I have an anecdote. So I’ve been using Copilot obviously just before interviewing you. I wanted to try it so I’ve been using it for about a month. Mine is a little bit different. So I’ve been programming, and I’ve come back to Java after a really, really long time, like say 15 years and I had this piece of code that I had to write because one of my buddies who was writing the Java code was actually not at work for, he was on vacation and the great thing was the Copilot actually made me complete this task in about half a day. That was great.

Priyanka Raghaven 00:16:42 So I was done, which would’ve actually taken me some time because yeah, it’s just been rusty. However, in the PR process, in the peer review comments I got that it was very sort of a novice code and I could have used a better library, and I was wondering whether it was because of the fact that Copilot was not looking at my, say the Palm.XML and what version of Spring that I was using and things like that. So the question I was going to ask you was, is there a way to feed back to Copilot that hey, can you just improve your model? Can you look at these files? I mean you did talk about going between the windows, maybe I didn’t have my Palm.XML open. What can one do?

Eddie Aftandilian 00:17:17 So this is good feedback for us. One of the things about the way Copilot works is that we mostly are looking at code and not configuration. So, we’re not actually looking at your Palm.XML even if you have it open. And so, another thing about the way Copilot works that we’d like to improve is that imagine the underlying model here is trained on checked in code in public repos on GitHub. So it’s well formed and if you’re training to predict the next token, you’ve always got the imports at the top, and the imports are correct; otherwise that code wouldn’t have been checked in. But when you’re coding your imports, they’re not complete yet. So Copilot will assume that the imports that you have in the file are the ones you actually want to use and then try to do its best to use those. But it seems likely that, at least my experience is often I actually want it to recommend a library for me, especially when I’m coding in an unfamiliar language and I don’t know what the common libraries are, I would actually really like Copilot to suggest the standard library that people use to do this task. So that’s an area of improvement for us.

Priyanka Raghaven 00:18:27 Okay, great. So you can actually start off with something and then build upon that. So that might be a helpful starter. Yeah, I agree on that. One other question I wanted to ask you was also in terms of developer productivity, right? Let’s get into a bit of that. I think there’s this paper called “The Productivity Assessment of New Code Completion.” I think you are one of the authors on that. The two points in that paper that really stuck out to me was one was of course the fact that Copilot seemed to perform better on untyped languages like JavaScript or Python. The second one was that developers seemed to be more accepting of Copilot suggestions on weekends and late evenings. So, can you just like, break that down to us and I found it very interesting so can you comment on that?

Eddie Aftandilian 00:19:11 Yeah, yeah. We found that that interesting as well. So, in terms of performance on different programming languages, we have seen that Copilot seems to perform better on JavaScript and Python than other languages. We’re actually not entirely sure why, like we have a number of hypotheses, but we haven’t validated these. But you could imagine maybe for some reason it performs better on untyped languages or dynamically typed languages as opposed to statically typed. Maybe it’s because they’re very popular languages and so there’s more code in the training set to learn from for those languages. Or it could be some other reason that we haven’t thought of. One sort of surprising thing about performance by language, we measure acceptance rate. Acceptance rate is one of our key metrics. That’s what fraction of the suggestions that Copilot shows does the user accept. We look at a breakdown by language and sometimes we see that even less popular languages sometimes have a higher acceptance rate than the mean or the median and not sure why, but someone asked this a while back of they had assumed that Copilot wouldn’t perform well on Haskell because there’s probably not a lot of Haskell code in the training set.

Eddie Aftandilian 00:20:21 I went and looked and actually Copilot performs better than average on Hakell and we don’t really know why , but sometimes the behavior of these large models is, is surprising. You mentioned the higher acceptance rate on weekends and evenings. So this is an effect that we’ve seen consistently. Like this is a pretty important effect that we have to be very aware of when we look at data, when we run A/B experiments, for example, when we run A/B experiments, we have to ensure that we have a full week of data before we make a decision on the outcome of the experiment because otherwise you’ll get skewed results based on overrepresentation of weekend or weekday and in fact it’s fairly subtle like you, you need to actually look at data in multiples of weeks and then maybe there are seasonal effects that we haven’t uncovered yet.

Eddie Aftandilian 00:21:13 So this is all, it’s very interesting from the perspective of like how do we make evidence-based decisions for improvements and so on. We’re not totally sure why this effect happens. Again, we have ideas but again, haven’t validated them. My personal hypothesis here is that on nights and weekends people are working on personal projects and these are probably smaller and simpler and they’re just fundamentally easier for Copilot to deal with. They’re probably easier for the developer to deal with, but we don’t know why this is happening. It does happen, and it consistently happens. We have to take into account when we do experiments.

Priyanka Raghaven 00:21:53 Interesting. So, I wonder when the data cannot tell you why something is happening, then what do you do? Do you do some behavioral, is that, I mean just out of software engineering context, but just wondering.

Eddie Aftandilian 00:22:03 Yeah, well often the data could tell us, we just haven’t dug into the data yet to find out sometimes maybe the data there it’s not sufficient to answer the question and we’d have to go back and collect additional data and then we also have to balance that with whether it’s considerate of users’ privacy and so on. So sometimes it’s just not, the trade-off here is like is it worth answering this question versus collecting more information from the user.

Priyanka Raghaven 00:22:29 Ok, yeah, that makes sense. That makes a lot of sense. The next question I wanted to ask you was also in terms of the field of pair programming. Do you think that’s going to go away because you have now this AI powered friend that’s going to help you?

Eddie Aftandilian 00:22:43 I don’t think so. I think people will continue to pair programming. It’s, I mean we aspire to be an AI pair programmer, but human is still a better pair programmer, and so I think people who like to pair program will continue to pair program.

Priyanka Raghaven 00:22:57 Yeah, because I think in the similar context there’s another question, so a few days back we had this discussion in my company on improving code quality. So I had suggested that we do some apart from having the human in the loop because oftentimes you’re so pressed for time that when you’re doing the peer review also you might just approve something without really going into it because if like if you’re a senior member on the team and the people are like, you have like so many PRs to look at, you might just look at something very quick. I suggested that maybe it’s time to have a AI-powered peer reviewer doing first round and then of course the human comes into the loop and that was of course vehemently struck down. In fact, I think one person I had quoted and I was quite taken aback with the comment and said that is the downfall of the software development process. But I’d like to know your thoughts on that. What about the peer review process? Do you think that is something that an automated AI-powered Buddy could help?

Eddie Aftandilian 00:23:50 I do think so. I hope it’s not the downfall of our field. Like, I think we’re not there yet, right? So, I think in code review, I think it’s feasible in the future that like you can have an AI bot that helps you review code. I mean in some way, existing static analysis tools and linters are one form of this. They’re not machine learning driven typically, right? They rely on sort of hardcoded rules that are produced by an expert, but they are one way to provide automated feedback on PRs. That’s one of the things I’ve worked on at Google and I always saw our tools as — I wanted them to be helpful to the users. I didn’t want people to feel like they were annoyed by these things or that they had to check a box to merge their PR.

Eddie Aftandilian 00:24:38 I wanted them to actually be happy that the tool pointed out some problem that otherwise would’ve been a real bug in their code. And so, I think there’s a pretty high bar to making code review comments and sort of autoreviewing PRs, but it also seems like something that’s pretty plausible in the not-too-distant future. You could probably train a model to predict code review comments. You could probably train a model to predict how to respond to code review comments. And so, I think this kind of thing is coming. I hope it works well.

Priyanka Raghaven 00:25:12 Right. Going back to the linters and so I’ll ask you a question, it would be useful actually to see if you have, for example, it looks at a rule set, right? Like if you look at the linters, they have a kind of static rule set, but it would actually work good if the Copilot suggests fixes based on these rule sets within these hardcoded rule sets. So it doesn’t go to say the public repo but looks at your own code to suggest fixes. Is that something that’s also in the pipeline? And would that mean that maybe in the future we would probably have probably not have linters, but this thing that could look at your code and suggest fixes, existing code?

Eddie Aftandilian 00:25:50 Yeah, so this is, I think what you’re proposing is like imagine you’re getting comments on your PR. Could you imagine an assistant that suggests the fixes for you and maybe you just click accept or it just goes round and around on code review in the background while you sleep? I think this is, again, I think this is something that’s feasible. There’s literature in this area that I think is pretty convincing. Facebook has a tool called Getafix that they use and they take static analysis warnings that they see in their code base and they mine their code reviews for how do people generally address the static analysis warning. They mine a rule out of it and then they ship that as an auto fix, like a suggestion that now comes along with this type of static analysis warning in the future and the user can accept it without having to write the code on their own.

Eddie Aftandilian 00:26:41 Another bit of related work at Google, I worked on a system to automatically repair code that didn’t compile. So imagine you’re working on your code base — this is in a compiled language, so you run the compiler, the compile fails and then you, you go add the semicolon or fix the type error or whatever it is and then you rerun the build and it succeeds. So there we built a tool that used machine learning to figure out how to repair code that didn’t compile based on the particular compiler diagnostic we got. So, I think these are things that are feasible. I’d be interested in working on this type of thing, again, in the future.

Priyanka Raghaven 00:27:18 Did you say Getafix is the one from Facebook? I probably look it and add to the show notes so people

Eddie Aftandilian 00:27:23 That’s right, Getafix. It’s an internal tool at Facebook.

Priyanka Raghaven 00:27:28 Okay. So we could probably switch gears and go a little bit into some of the, I would call the maybe like negative feedback or criticism that’s out there about the GitHub Copilot. So, the first thing I want to talk about is there’s this paper called, so I am a cybersecurity architect, so I was obviously interested when I was looking at the ACM journals. I was looking at one of these things which said “an empirical cybersecurity evaluation of GitHub Copilots code contributions.” I think that was what it was, where it basically looked at about 89 scenarios for the Copilot to produce a code and it produced about, I think quoting from the paper 1,692 programs and they said about 40% of the code that Copilot suggested was insecure? The reasons there, it said, is that because Copilot was trade not public repos and there was obviously insecure code. So I was wanted your comments on this as a new attack vector. Maybe there’ll be people like creating malicious code in public Git repos and say, okay, Copilot’s going to get that and then people are going to start having insecure code. What are your thoughts on that, and how do you combat that?

Eddie Aftandilian 00:28:35 Yeah, sure. So this is something that’s very important to us. In the paper, the authors created scenarios in which Copilot would have to write sort of security-sensitive code. So yeah, they acknowledge this in one of the threats to validity. So, it’s important to note that these are not like 40% of all suggestions that Copilot delivers are insecure. It’s in these particular sort of security-sensitive scenarios that this happens, and they acknowledge also that like the reason that Copilot suggests these things is that humans who wrote the code that Copilot was trained on also make these mistakes. I’m sure as someone who works in cybersecurity, you’ve seen that even excellent developers make mistakes, right? So, in terms of the sort of immediate things that we recommend, we recommend always running with a static analysis tool embedded in your workflow. Like I said, this is what I did at Google, and if your goal is to eliminate a class of security bug from your code base, it doesn’t matter if it was written by Copilot or if it was written by a human, you need to have a checker somewhere catching these things and blocking people from merging code with these problems.

Eddie Aftandilian 00:29:52 In terms of, from the Copilot perspective, what we can do here, we aspire for Copilot to be better than a human programmer. And so, we’re investigating this at this point. You can come at this from two perspectives. One is you can analyze the output that Copilot produces and either redact — like just don’t show insecure completions — or you can highlight those in the IDEs. Like you could have an integrated security scanner or we could package with a pre-existing integrated security scanner that runs in the IDE. The other way you can come at this is by trying to improve the underlying model and push it toward generating more secure code. So, maybe you filter the training set for insecure examples. One of the sort of weird properties of these large language models of code is that they interpret comments and sometimes silly comments can improve the code quality.

Eddie Aftandilian 00:30:50 So, we’ve found that things like just inserting a comment where you say “sanitize the inputs before constructing this SQL query” makes the model actually sanitize the inputs before constructing the SQL query and then mitigates a potential like SQL injection attack. So, there may also be things on the prompt construction side we can do to push the model toward generating more secure code in the first place. I also just wanted to mention, I mentioned my background in static analysis, the researchers used a tool called CodeQL, a static analyzer, to detect the security vulnerabilities. A fun fact is that a lot of the team members who work on Copilot previously worked on CodeQL. So, security and static analysis is sort of an important topic for a lot of the team members, as well.

Priyanka Raghaven 00:31:40 Okay, that’s good to know. While you’re talking about this running your code through an SAAS or code QL kind of checker, I also remember this other video that I saw on YouTube from one of your colleagues at GitHub Copilot, where he talked about how do you check whether the Copilot is producing good code and he actually in the video there is a thing where it also runs a bunch of tests on the code. Is that something that’ll be there in the future? So, as soon as the Copilot generates some code, it’ll also produce the tests in a desktop so that you can sort of run that. Is that, is that something that’s also going to be coming together?

Eddie Aftandilian 00:32:17 There are a few things bundled here, I’m going to try to unbundle them. This video is by my teammate Albert Ziegler, and he is talking about how do we evaluate the quality of let’s say a potential new model that OpenAI has, or a potential improvement that we have to prompt construction, or these kinds of things, right? And so what we do, we call this the harness. So we do, our first step is to do an offline evaluation. I talked a little bit about A/B experiments. We do those, but that’s later in the pipeline. So the first filter here is an offline experiment using the harness. And the way the harness works is we take public GitHub repos and we attempt to install their dependencies and run their tests, and then if the tests pass and they have good coverage of the functions in the repo, then we take a particular function that has good coverage, we delete its function body and we ask Copilot to generate a replacement.

Eddie Aftandilian 00:33:16 Then we rerun the tests and if the test passes, we call it a pass. And if it doesn’t, we call it a fail. And so this is kind of our first step in evaluating quality. It accounts for the fact that we don’t need an exact match of what was there. We actually don’t want an exact match of what was there because that sort of implies that the model has memorized something. So we want actually a slightly different completion that has the same behavior on the test. You asked sort of as a question whether Copilot might generate tests for you in some future version. It’s a bit different from what we’re doing here. This is, this harness is about evaluating quality for our team. It’s not something intended to be user-visible. I think generating tests is another place where Copilot could be helpful. It’ll gamely try to help you, it’ll try to write tests too. It’s just another form of code. It works, in my experience, I think it works okay if there are example tests for like if you’re in a file with example tests, it’ll do a good job of duplicating what’s there and adapting them to different test cases. You’re still going to have to edit them. I also think that test cases are an interesting place where we could probably do something special and make it much better at writing tests than it currently is.

Priyanka Raghaven 00:34:27 Okay. The other thing I wanted to ask you in terms of the negative criticism that’s just get back onto that, I was also about this being a disruptor to the field of software development. So this is something that I’ve heard from many quarters, I mean right from literature online to maybe also informal chats with fellow friends, engineers, et cetera. Do you think that maybe it could be the end of entry level software engineering jobs? I know it sounds pretty harsh, but just curious.

Eddie Aftandilian 00:34:56 I don’t think so. My hope is that tools like Copilot will lower the barrier to entry and enable more people to become software engineers. You said, like, could this eliminate entry-level? I think it’s the opposite. I think it’ll enable more people to be entry level software engineers and to help those entry-level software engineers become more productive more quickly and to write better code. If you look at the past in developer tools, we’ve seen that new developer tools, they help, they augment, they don’t substitute for developers. You might have imagined back in the days where everyone was writing machine code or assembly that like compilers would cause fewer compiler engineers or fewer developers. It’s been the opposite. It’s opened the field to more people and empowered more people to write code, and I think Copilot will do the same thing.

Priyanka Raghaven 00:35:47 Yeah, I think that’s probably what you said about the, I like the anecdote about the assembly to compile a code. I think it’s the way you use the tools and maybe that we are probably a lot of the donkey work that we do would also be gone, could be.

Eddie Aftandilian 00:36:03 Yeah, hopefully. Hopefully we can automate the boilerplate and let developers focus on the more interesting parts of the job.

Priyanka Raghaven 00:36:10 Right, yeah, yeah. Can you comment a little bit about the privacy angle on the public repos? Because I think there’s also a lot about, does everything that is public become open-source? And then there’s also this term called code laundering, which I think even stack overflow. I think there’s a paper that says, I think IEEE, which says the Stack Overflow could also contribute to code laundering, but I think that’s again one of the things that they talk about Copilot because of the searching on public repos. Does all of that become open source? Can you comment a little bit on that?

Eddie Aftandilian 00:36:41 Sure. So I guess first I want to be clear that we do not use private code to train the underlying model, and we don’t suggest your private code to other users of GitHub Copilot. We train on public repos on GitHub. In addition, we also, we’ve built a filter that filters out, it detects and filters out rare instances where Copilot suggests code that matches public code on GitHub, and users have the choice to turn that on and off during setup. In terms of this idea of code laundering, we think that Copilot and Codex, it’s similar to what developers have always done. You use source code to learn and to understand and we think it’s critical that developers have access to tools like Copilot to empower them to create code more productively and efficiently.

Priyanka Raghaven 00:37:32 Okay. It’s interesting on the setup, can you just explain that again? So when you actually create a public repo, you have an ability to say whether you want to contribute to Copilot or not? Is that what you’re saying? If whether your repo can

Eddie Aftandilian 00:37:44 No, no, no. The filter is for users of Copilot.

Priyanka Raghaven 00:37:47 Ah, okay.

Eddie Aftandilian 00:37:48 So like I said, we built a system to detect when Copilot is producing a suggestion that matches public code somewhere on GitHub. And if you enable that option then Copilot will just not suggest things that are copies of code elsewhere on GitHub.

Priyanka Raghaven 00:38:07 But maybe that also makes sense, it’s just like one of the requirements session, but, maybe it also makes sense that when you set up a GitHub repo you could also say, hey, I don’t want to suggest my repo shouldn’t be suggested by Copilot, shouldn’t be using the experiment. Is that something that’s possible? I’m curious.

Eddie Aftandilian 00:38:23 I can’t comment on that.

Priyanka Raghaven 00:38:25 Okay. But yeah, that’s maybe something that we could ask on the GitHub issues. Okay, that’s great Eddie, I think let’s go onto the last part of the show where I want to ask you a few questions on the future of Copilot. The first thing I wanted ask is Copilot of course requires us to be online to actually get it to work. So is there something being done to work in offline mode?

Eddie Aftandilian 00:38:48 So, I think that’s interesting direction. As I mentioned before, the models that power Copilot are very large and very resource-intensive and so it’s not feasible to run them on really any machine that a person would have any personal machine. We don’t have plans in this area.

Priyanka Raghaven 00:39:07 Okay. Unless you have a very, what do you say, GPU many GPUs on your laptop and then, yeah.

Eddie Aftandilian 00:39:14 Yeah, you would need industrial grade GPs, even your gaming GPUs are not sufficient.

Priyanka Raghaven 00:39:24 Ok, good enough.

Eddie Aftandilian 00:39:25 Can I ask you a question here? How often do you code without access to the internet?

Priyanka Raghaven 00:39:28 That’s, you caught me there probably never. Yeah, it’s been a while.

Eddie Aftandilian 00:39:34 It would be hard, right? Yeah. You are always looking stuff up, looking up documentation, going to Stack Overflow and so on.

Priyanka Raghaven 00:39:40 That is true, but it was, something that struck me was, of course I think I’d be lost without the internet. Bad confession to be on Software Engineering Radio. Other things of course ah, you know very comfortable like for me, like right now Python, C# I’m fairly comfortable. I could do stuff, but yeah, something new. I mean even there just, I would always searching stuff online, so yeah, it’s true. Since we are doing a natural language processing, I wanted to know is there a scope for a voice activated coding for the future? Like my job is saying, Hey, Java is, please write me some, get me a binary research tree on my IDEs also direction.

Eddie Aftandilian 00:40:19 Yeah, I think that’s an interesting direction, and I think the critical bit there is like what does the interaction look like? How, well if you start thinking about this, imagine you want to like dictate code, that would be really hard. You would be talking about punctuation and you just semicolon, it would be very awkward. And so being able to do this at a higher level I think would be really helpful to people. It would be interesting to explore that.

Priyanka Raghaven 00:40:44 Okay. Is that something that researchers are looking at or no?

Eddie Aftandilian 00:40:48 I’m sure some researchers somewhere is looking at that.

Priyanka Raghaven 00:40:53 The other question I wanted to ask this interesting. There’s certain languages, for example, say Cobol and the mainframe technologies, which actually some companies still have things running on them, but there’s really a dirty of developers in that field. So companies really struggle to find people who know those languages. So is there something like these codex moderns could be trained on those languages and maybe companies pay for that to run on their mainframe machines? Is that also something that GitHub is looking at?

Eddie Aftandilian 00:41:24 We’re exploring offering a version of copilot that’s been adapted to an enterprise’s private code base or set of private code bases. I hadn’t really considered this from sort of the Cobol or like Legacy programming language angle. But it seems possible that such an adapted version would, would work well for those kinds of legacy languages that it hasn’t actually previously seen much public code for. Our goal in all of this is to assist developers and make them more productive. And so I think it’s kind of similar to your earlier question about learning, helping programmers learn new languages. You, you can imagine this being helpful for a non-Cobol programmer to be able to product make changes to an existing Cobol code base.

Priyanka Raghaven 00:42:10 Okay. So an enterprise addition would then kind of help? Yeah.

Eddie Aftandilian 00:42:13 Yeah, I think so.

Priyanka Raghaven 00:42:14 Okay. I think that’s all I have Eddie. And finally before I let you go, I have to ask you, where can people reach you in case they want to contact you more about Copilot?

Eddie Aftandilian 00:42:25 Sure, so I have a Twitter account. It’s eaftandilian, so E and then my last name all one word. My GitHub handle is @E A F T A N.

Priyanka Raghaven 00:42:38 I’ll definitely write that on the show notes. So thank you for coming on the show. It’s been quite enlightening for me, so I hope the listeners enjoy it.

Eddie Aftandilian 00:42:46 Thank you very much. This was fun.

Priyanka Raghaven 00:42:48 Thank you. This is Priyanka Raghaven for Software Engineering Radio. Thanks for listening. [End of Audio]


SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Join the discussion

More from this show