Episode 517: Jordan Adler on Code Generators

Filed in Episodes by on June 21, 2022 0 Comments

In this episode, SE Radio host Felienne spoke with Jordan Adler about code generation, a technique to generate code from specifications like UML or from other programming languages such as Typescript. They also discuss code transformation, which can be used to migrate code — for example from Python 2 to Python 3 — or to improve its internal structure so that it conforms better to style guidelines. Adler is currently the Engineering Director for the Developer Engineering team at OneSignal, and he was previously lead API Platform Engineer at Pinterest and a Developer Advocate at Google.

Related Links

Related Episodes

 View Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org and include the episode number and URL.

Felienne 00:00:16 Hello everyone. This is Felienne for Software Engineering Radio. Today with me on the show is Jordan Adler. He has been a professional software developer since 2003. He’s currently Engineering Director for developer engineering at OneSignal. Previously, he was API platform engineer at Pinterest and developer advocate at Google. Welcome to the show, Jordan. Today’s topic is code generation. So, let’s start with a definition: What, for you, is code generation?

Jordan Adler 00:00:46 That is a technique for producing code output rather than some kind of expected user behavior. So for example, a common code generation technique would be wherein unlike a compiler, which programming code into machine code AEs ORs translates programing code from language to another. So a one of these would rights Java way. That would an example is of,

Felienne 00:01:33 Yeah, that’s an interesting question and answer for example, because that leads to the question, like why are we generating source code? Why are we not just typing source code? Right. So what is the benefit of generating JavaScript from Cript or in other contexts generating certain pieces of software? If we can also type that, right. I get it for assembler, no one wants to type bid code or assembler, but why Cript it’s fine. Why are we generating this?

Jordan Adler 00:02:00 Yeah, there are lots of different reasons to do that. Typically the answer is productivity of one reason or another, right? So if you are trying to write piece of software and there’s a lot of duplicate code in that piece of software, perhaps it’s duplicated cause you’re one of five different teams, each trying to build a system and they all interact with each other and maybe they use different languages, but they all have the same kind of interface, the same specified method of interact with each. You might want to procedurally generate kind of that interface code so that when you actually change the way that the servers communicate with each other, you only have to change them in one place instead of five places. So that that’s a common reason. Another common reason could be to, like I mentioned, with the Java, perhaps you’re conducting some checks and in the process producing code that is consumable by some other tool.

Jordan Adler 00:02:54 Another example might be lots of folks have Kubernetes YAML, right? That becomes unwieldy and repetitive after a while. And so there are tools out there that can actually produce Kubernetes for you based off of. And so that process effectively generates code, declarative code that is kind of ES consumes. And so there’s lots of different kind of reasons people might want to do this, but typically they boil down productivity. You have some kind of machine that, or, or some kind of system that expects either kind of a computer system or people that expects kind of code to come in one way and can kind of enable you fit that standard or its technique you can use to fit that requirement. Reducing the cost of actually

Felienne 00:03:38 Yes, generally it’s quicker. And it might also be less error prone because you can do some checking before you actually generate the code. So you’re generating correct code in for a

Jordan Adler 00:03:49 Definitely for correctness duplicate code, you can kind produce multiple different versions of the same input, right? So the process of doing that as opposed to having someone write it out, it’s a lot quicker and less airplane. Absolutely.

Felienne 00:04:04 Yeah. That makes sense. So you already sort of hinted that some concrete examples, but can you give a certain example of a situation in which you use a coach generating tool to, to solve a specific problem?

Jordan Adler 00:04:17 Yeah. So one example would be we have this tool called tool that’s code application to add an SDK into kind of a mobile. So you have a code base, it’s an Android, an I app. For example, you can run this tool, it’ll scan the software programming code for that application. The right changes, changes the code to be able to include the SK. So this is a kind of code process, technique code transformation, where you take one piece of code you another piece of code, but you’ve modified the code in some way, not unlike, but the difference here is we’re not converting from which to another, we’re just kind of keeping it in the same language. Maybe we’re semantically changing the behavior of the application.

Felienne 00:05:15 Yeah. So we’re like enriching an existing code base with some features. And later in the episode, we want to dive into code transformation specifically as like a separate process from code generation. I’m also wondering like, are there Antibes, are there situations in which you would say that co-generation might not be the right solution?

Jordan Adler 00:05:38 Yeah. I mean, oftentimes it adds quite a bit of complexity, particularly in your build tool. So if you, if you have a situation where you think you able to, you might be able to save developer time by code, generating some piece of the code base before kind of building and, and, and producing it. Now that that adds onto your build process. So that can add time to each build that you of when actually also in of development, right in the mix during that kind of tight developer loop, it’ll up taking longer. And so oftentimes the tradeoff here is yes, I’m spending a lot less time running code, but I’m spending a lot more time waiting for code to be generated. That is a tradeoff that you have to make intentionally. And the productivity gains will have to the cost of establishing the pattern, which is complicated.

Felienne 00:06:52 About whole. We want to talk about this, the whole build process of code generation also deeper in the, in the episodes. But one question maybe that sounds a little bit abstract still for people that have never used code generation tools is like, what does a code generation tool look like? Do, do I write code to generate code? Or is this a visual tool why sort of collect the interfaces together? And then it generates code from a, from a visual model, from something like, what does code generation look like? Practically?

Jordan Adler 00:07:23 That’s a great question. I think in practice, all of those are tools that you can use kind of in a one-off visual tools. For example, to build out would say sequel specification, like instead of statements to create tables, there are lots of tools out there, table designing tools, that statement consumed statements, database, that is a, a case, certainly another common one, the most common again, if you have something like swagger, which is specific, specific swag, you can have in Y or J a definition of API and run a CLI tool that procedurally generates from that specification client libraries or perhaps servers or pieces of code that is then consumed by a job application that fills out stubs of those, of that interface, right? So it can vary in terms of interface. It can be CLI based, it based, it can be something you use once as part of your development process and never use. Again, it could be something that you use every single time you build and it something you use manually when you pull something from up, it’s a technique that could be used in many different ways, for sure.

Felienne 00:08:48 Nice. So that gives us a lot of ways to apply. Co-generation in projects now we have generated. So the code has been generated with one of the variety of the tools that you just described. So that now what do I manually read this code? Is there some sort of verification or do I verify to generation, what do you do in that case? Like, do you ever look at the generated code? Is it ever necessary to inspect that? Or is it sort of correct by construction?

Jordan Adler 00:09:17 Oh, absolutely. And , you can establish a pattern by which you can kind of generate code and have that tested in a, that enables you build confidence. An error. For example, when I was Pinterest code Python, Python, that as we were converting bits and pieces of code from Python Python three, well, we could deploy a piece, convert a small chunk of it, deploy it to a portion of our overall fleet. Let’s say 2%. And then if two of our fleet is running this new version with these new modifications and it’s all the same requests and returning all the same outputs and not having any new errors, not producing any new issues. We can probably say that it’s safely kind of consistent between the two versions. We, so in cases where you have a deploy process where you Canary like, or have some other processes, statistically eliminating kind of risk and you can move forward carefully, then automating the process of deploying code generations is not unreasonable.

Felienne 00:10:35 Yeah. And so I wanted to say like, this is a situation in which you already have running code, you have a baseline, right. What it’s supposed to do and you can migrate parts of it, but this is of course not always the case. So I was wondering if you also have examples of experience with sort of freshly generating code where you do not have a baseline to test again.

Jordan Adler 00:10:55 Oh, absolutely. And in most cases you really should manually code. So even when we were working at Pinterest on this tool, on this project to Python Python, we were routinely manually inspecting the changes that were coming through. And honestly, like some of the code transformation we had, they were very, they were not error prone at all, right. They, they were fairly straightforward convert. This function parenthesis so is longer. That’s your statement. Now it’s a function. That’s a pretty straightforward thing to change until you start throwing in complexities. Like, well, what if we have our own function called print that weed? Right? So we, if we have some kind special label in our code called we’ve modified some way, so it’s not, or what if we have function calls that look like print and perhaps the regs that we use to convert the code or, or whatever technique that we use to actually the overzealous.

Jordan Adler 00:11:57 So we’d and review part, if you were to run, for example, we have at one signal API client that I mentioned again, that we, we procedurally generate from specification files. And so the output of that change from we pull in changes from our generator source repository pull manual. Um, we, we pull them in manually. We rerun the co-generation and then we review the changes that occur before landing can’t for certain what the changes. So that is more review process based, or even kind of the, the PR inspection, which is much more kind of scrolling through thousands and thousands of changes and looking for outliers as opposed to kind of really deeply inspecting every single line. That’s changed trying to understand it and understanding.

Felienne 00:13:04 Yeah, that makes sense. And I guess there’s also a difference between if you are the person that is authoring the co-generation tooling, or if you’re simply using something that has been extensively tested, then probably you can rely a little bit more on the fact that the generation will be because it has already been tested by many other.

Jordan Adler 00:13:23 That’s a really, and I think you’ve hit on something interesting code generation, which is that it often involves collaboration between people. It’s a technique that is pulled out when two teams or, or two groups or, or two pieces of software have to interact with each other two or more really. And so having that consideration of, okay, where is this code coming from? Who wrote the code generator and understanding that is as much of a process of, of understanding how to integrate and deploy this technique in your code base as anything else.

Felienne 00:13:56 So let’s talk about practicalities. Yeah. You already mentioned that this code generation will then be part of your build process, which might be time consuming, but also you get some interesting questions. Like what do I do with it generated source code? Do I check this into version control or is this typically something that you would put in and get ignore? Because, well, if you need it, you can just generate it again. I can imagine that for the, for reasons of traceability, maybe you also want to ship the generated code. So you are sure that everyone looks at the same version of it. What are your best practices there?

Jordan Adler 00:14:30 Yeah, I think I don’t, there are ES comes code really compilation and, and kind of the consideration of, of kind of managing code. There are lots of different ways to, to kind of treat code as data and lots of different patterns of, of kind of using that. I have seen cases where people have kind of generated code and then for example, in Java, right, and then created modified the exact same file to out the stub actually on updates to the, uh, API where you can kind of then procedurally generate the changes to the server function. Then you can just kind of get a patch file, run that against your file and then manually edit it. Right. So that, that can work. You can have kind good code in, in the same files, if you’re be manual and reviewing, if you’re going to be automating it, I probably would not have in the same files.

Jordan Adler 00:15:39 I probably also whether or not check depends on whether you code is more of an intermediary object or more of a kind of desired output of some kind of. And so that will, right. So example the client libraries generated code is the product right? And so for us having that checked into diversion control actually makes sense, not in the repository that contains all the code that generates it. So we have a code that what one repo, where all the code generated client libraries, and then s other re for the client libraries, libraries, Java.

Jordan Adler 00:16:19 So the reality is that you need to kind of use whatever approach makes sense. My only cautionary statement here and, and kind of the good of, of thumb here is when you’re working with, with a language that’s typed, you take advantage of that typing. And if youíre using code generation in a way that basically creates an intermediary layer between the procedurally generated types and the types that you’re actually using in your handwritten code, in other words, if your handwritten code and generated code two totally different type graphs, and they’re not connected at all, then your type, Checker’s not really doing its job. And, and that’s, that’s a problem. So you do have to be conscious of that. But other than that, I would say there, there’s no kind of hard and fast role, and it really depends on the situation.

Felienne 00:17:13 Yeah. I think I can add an example there from a project that I work on myself, cuz sometimes it’s also about like what tooling do you expect people to have? So we have a backend that in Python and most of our open source developers actually work on the Python side. And then we have a little front end that’s written in inside script that we then transpire to JavaScript. So we do check in the generated JavaScript because just because we think that it’s a hassle for the Python developers to have to generate a Java script themselves, they might not have N PM. It might just not be ready for that type of tooling. So it’s like a courtesy to people who are like, oh, here’s a generated code. If you’re, if you’re not changing anything in the front end, you don’t need to compile the ore, the code. So sometimes it’s also about, do you require the users or the contributors in your project to also install all the code generation tooling, which might sometimes be also, uh, complex to deal with. So that’s maybe also a consideration that you can have that not only who will, or who needs to generate the code, but also who will sort of feel like installing all the tools that make the code generation happen.

Jordan Adler 00:18:15 That’s a really interesting point. And kind of actually, interestingly enough, is illustrative of the difference between kind of commercial applications of this technique and source or academia where you want volunteers, you want, you want people to join. And so you want to kind of minimize the cost threshold effort contribute code. And that’s not true necessarily in a commercial setting where I’ve most work environment where I, well

Felienne 00:18:45 To tough, yes, you just have to do what I say yes, exactly.

Jordan Adler 00:18:47 Install this thing. Or I, I, , added it to the device management. So we don’t even realize it, but you already have Java compiler. So

Felienne 00:18:56 Yeah, because sometimes this can really be a big block. Like I was looking into another code generation tool and then it’s like, yeah. And you have to install eclipse. And this version of Java, I am never use Java. And then there’s sort of need for Open-Source work. It is a threshold like, well, if it requires me to install Java, then I don’t feel like doing this. Maybe it’s not worth it. So that tooling angle, and it’s very right, that you point this out is very different in Open-Source projects where indeed, um, we want to make it as easy for you as possible. We don’t want to force Python developers to install tooling that are like, what is this? Why need that?

Jordan Adler 00:19:33 Yeah, that’s a great point. There, there are lots of kind tool kits out there for you opensource tool kits for generating or building code generation tooling. One of them is called yellow code, which is written in JavaScript rather. And that one is one that we using for a lot of our web. So on web specific to reactor or angular. And so we’re able to produce those kind of procedurally generate higher level SDS for these framework on of web SDK. We didn’t do that. The same Java tool we for code really exist for building these things. I have to imagine to some extent exist in part because of what you were saying, right. Like thereís, a lot of these things existed beforehand, but none of them kind of the same.

Felienne 00:20:28 Tool, the

Jordan Adler 00:20:29 Consistent tool.

Felienne 00:20:33 Yeah. We will definitely add a link in the show notes to the jelly code tool. Then I was also wondering what about documentation? Right? So if I’m generating code, where does my documentation live? Do I generate documentation that’s in the generated code for when people inspect the generated code? Or is that documentation typically placed wherever I’m writing the specifications for the generation, whether that is in a different programming language or in a visual tool, or is this something that lives in a markdown file where it just says, this is how you generate the code and this is what happens. Are there any best practices there?

Jordan Adler 00:21:10 Yeah. I mean, I, I think that the best practices when it comes to documentation is yes, all of them I think it’ll depend. So to give you an example, we’ll, we’ll often procedurally generate, like I said, API client line, right. And that includes our API reference in it. So we have kind of a Python classes that are stuff out that include doc strings or documentation and an inline as Python developers expect them. And that comes from our YAML file, the opens, uh, open API specification kind of GA file that says, okay, this, uh, if you call a put on this path on our server, that is actually this function and here’s what it does. And here are the parameters and so on. And so that kind of, YAML files consumed procedurally generates and actually creates the client libraries. And so we have kind of one place where we kind of update those API documentation and then propagate that downstream to 10 different, very easily.

Jordan Adler 00:22:10 So that’s one place where documentation, so that’s kind of documentation result. We can also procedurally generate just an API reference itself, right? So kind of a markdown think of it as, instead of producing a output of this kind of specific producing generator, the source project includes so procedurally generate markdown documentation or other kind actually host. And that’s in the generator project itself, which that’s kind of one piece, but in our own kind of repo where we host all the code that actually executes as part of our tool chain includes all of our patches to the downstream libraries. That repository also includes instructions for people who are working on our client libraries on how to specifically use it for us. Right. Which includes by the way, , how to patch the bribe for the result in client libraries to have kind of manually crafted procedure libraries from the templates are not always there’s documentation reference inserted into the code that’s being resolved in as well as produced as an additional target that we can serve alongside our client libraries, as well as the documentation that exists for the developers using are working on our system and not the ones that are consuming the code by

Felienne 00:23:48 System. Yes. Yeah. So, so indeed there are these different forms of documentation. That’s probably a good idea to have it anywhere. And if you so specification about what you’re going to generate you might as well generate that specification. Let’s go from code generation more towards code transformation. We have already talked about this a little bit, but what exactly is code transformation? Now we have a process in which the input is code and the output is also code, but then there’s also code defining the transformation. So what does code transformation look like for you?

Jordan Adler 00:24:25 So if you think about kind of code generation code transformation, as both things that output code, right compilation also outputs code. So compilation takes in programming code outputs in programming code outputs, programing code, maybe in a different language code generation takes in something semantically and outputs code, right? It doesn’t have to be code. It can be some kind of configuration object or something like that. Code transformation, however, takes in code and outputs more or less the exact same code, but having been modified in way. And so code transformers sometimes called code modifiers code modifiers. They can take a variety of different shapes in terms of how they’re implemented, but really what they try to do is something that’s basically the same language, but with some modification in the itself, either semantically in the case of say, , a code transformer, that’s trying to change the behavior of a function, , and maybe you have to change everywhere called as a result, right? If you have a very large code base, you might not want to do that manually. You might a little code to update, , everywhere is called to change the parameters that are being passed around. , so that’s a, that is a, , kind of one consideration transformative, like how code transformation is different than kind of other techniques in the space.

Felienne 00:25:48 Yeah. So your example made me think of a refactoring, right? So adding a parameter or changing the order of parameters, this is something I can do in the IDE. I write click a function IES, and then I can reorder the parameters. So that is a refactoring, but also a code transformation, like, like, is it refactoring an example of, of a code transformation or is it not because it’s not really done with a code generation tool?

Jordan Adler 00:26:14 I think refactoring is a common goal or common, common cause or use of coder EC code codes code know that’s a code transformer, right.

Felienne 00:26:34 So when we’ve identified like one tool to do code transformation with the IDE, but I guess there’s also other tools in which we write codes to, to script the transformation or to visually manipulate the transformation. What are tools that you typically use for code transformation?

Jordan Adler 00:26:52 That’s right. So if you take code and you’re tools you use code before yellow code is kind of a, a toolkit for, for parsing, so it’s a toolkit for making code transformers. And so it has elements of it that enable you to parse languages and, and represent programming code in a given language, say type script as a data object of some kind. And, and really like if you think about, okay, what is a, a code generator? What is a code transformer of some kind? Well, it starts by it’s really a two step process, right? Step one, get coding into data. Step two, get I guess three steps if you’re transforming it right, nudge that data somehow. And step three would be kind of producing or outputting that data back as code again. And there’s lots of different ways that you can do that. And lots of different tools. You can do that roll your own, certainly. Or you can use compiler chains that often have that first step and step, which is code data and data.

Felienne 00:27:59 And then what you are manipulating in between is the data representation, which will often be a Parry, I guess.

Jordan Adler 00:28:07 So it can be a par. So now we’re, we’re getting kind of deeper into kind of parsing and have classes. You might some of these, these things, but you can use an abstract syntax, kind of includes enough program source, all representations of program source back into source code. Once you’ve stripped out white space ands and so on, you can’t immediately turn. So a of compilers will multiple trim down, they’ll transform that or pythons virtual machine. But in our case, we’re going to go part of the way. So for Python, as an example, we can actually use pythons AST module. The thing that Python itself uses represent Python programs and code from its that we kind class, then we can modify it as we like, but there are other ways too. Um, for example, you don’t have to use kind of compiler tool chain. You can just use, or even kind of look for strings and manipulate strings, really any way that you can kind manage string, text as strings you can use for code too.

Jordan Adler 00:29:33 But the less context aware that you, your implementation is the more risky it is in terms of the error pro of the output and the less kind of cause you have to imagine if you’re going to this code transformer on multiple different kind code are, if you test on a million of code particulars in kind of transformer, you just donít know about and you won’t be encountered until someone else picks it up and uses it or not. And so you have to think about that as you’re designing your transformer, but certainly like simplest possible implementation could be a script that is basically a one liner call to find and replace and set or something like that.

Felienne 00:30:22 Yeah. And of course it can be easy, but also more air prone. If you are transforming Python to Python three, then you just want to add brackets around every print. You could do that with a little bit of string magic, but then maybe you’re not really sure that every print you encountered is actually really the print that you want to transform. So let’s talk a little more about this case study because you have worked on this Python two to Python, three transformation project, and I would love to hear more about like, like, did you do everything automatically or what are some edge cases that had to be transformed manually? And what was your approach? Can you just take us through that project, how you approached it?

Jordan Adler 00:31:00 Absolutely. And so I talked this project

Felienne 00:31:08 Link

Jordan Adler 00:31:14 Tool called Python, which produced by an outfit, Python pythons, a number of these three kinds of system. The first thing is a set of code transformers code modifiers that kind of take Python two code and convert it into Python, two code, but in a way that is more aligned with, or more gradually incrementally, more, you consumable there, a set of that different between Python and Ashe with transformer and Python actually included a function called underscores underscore, which the Python we call underscore. So includes is directive into Python code to I’m going to run this under Python two, but I wanted to behave like Python three for this specific kind of change. And so what we did at ed Pinterest was we went through these code transformers and kind of left our system running on Python2, but incrementally made it more able to run Python3.

Jordan Adler 00:32:50 And it starts with this code and these kind directives to the Python compiler that says, or Python two machine that says behave more like Python three in this way. Right? So kind of incrementally, including backwards, breaking changes from a version kind of hard to explain, but you have to imagine for a moment that essentially kind choosing to, to, to kind gradually cause that breaking to occur. A lot of that was added by the way, Python, which kind of out Python three. So this added the Python migration really started years before Pinterest Pinterest companies in part size of the code, this. So it starts with the code transformers. You manually kind of incrementally make it more to run Python. We have the Python future project includes some what’s called sores it’s import function that creates string objects that are more like Python3 than Python2. Once you produce Python two code that behaves more like Python three and is running Python two, then you can start bringing in these future functions or future classes that are basically run time shifts that model the behavior, the, of Python three under Python2. So you can start coding against Python3 in your code by pulling in from

Felienne 00:34:48 So you can migrate while you are also adding new features to this existing code base. That’s what you’re saying, right?

Jordan Adler 00:34:55 That’s right. Yeah. You can migrate while using features that would typically not be Python or specifically the, that changes Python three, you, in more of those changes either through directives to the Python virtual machine or through this kind of effectively user space implementations of core Python objects that are consistent between Python and Python. This is in contrast, by the way, another approach that you use is to do the Python two Python, three migration, which is basically if statements, you can say, if Python two, do this, if Python three, do that, right. And that kind of pushes the complexity into, or makes the complexity in our code base as opposed to kind of this module we’re

Felienne 00:35:44 Yeah, because if you have the complexity in the code transformation tool at one point, hopefully you are done. So then you no longer need that complexity. And then you end up with a cleaner code base that is 100%

Jordan Adler 00:35:56 That’s right. So when at the end of this project, the final stage, when you’re code Python virtues, user Python3, you can take that code, run it under Python2 right side by side, under Python3, confirm that they behave the same and then actually stop running under Python2 and then remove all those directives that are, , the cleanup patch is a lot smaller, right? It’s just remove a few lines from the top of each file that,

Felienne 00:36:34 Yeah. So let’s talk about tools for this project. So what did you use to write transformations in or to define the transformation was that this code tool you were, that was JavaScript tool, you use something else

Jordan Adler 00:36:48 It’s code Java is basis JavaScript based. So its not what we used here. It also, I think came out a little bit later. So Python uses the in the Python standard. So this is actually the Python itself uses Python Python. Well basically we take in code, we read it in, use the AST module. So it’s kind of reading code, turn it into an AST object, which is abstract syntax. And then we transform it. We look for specific. So we do like a typical, we look for, for example, maybe look for a node that is function, call type you, that function call type. You want to find out what function’s calling and you can pass and say print, right? So you can a little piece of code that says, Hey, once you’ve the abstract look for the, that a function call of we’re there we change. We, we, but if we never find it, then we don’t do anything.

Felienne 00:37:49 So this is tooling then that sort of depends on a certain programming language. Does this exist for any programming language? Can you transform Java with a similar approach or is this a very Python thing to have?

Jordan Adler 00:38:04 This is definitely right. Most, most compiled languages don’t have some version of this most or maybe most is kind of, I’m not sure if it’s but many interpret languages do so Python, Pearl probably have some version of abstract syntax class or some way to model Python code or Pearl code or PHB code, for example, in that language itself. But most of the time you won’t see that. And in fact, compilers, you may have to reach for kind of a compiler tool chain into there. Um, so for example, M is a kind of compiler tool chain project that’s out there and, and has, um, what are called compiler front, which basically take in source code as tech and what’s called intermediate, which, which is kind of as data in some way. And you can use front often in transformers I’m has on basically your front end is take let’s say C code, turn it M intermediate representation. And then your back end is just turn it into C. So you can just write your own calls, the C code into, to intermediate then.

Felienne 00:39:35 So is scenario that you would do that where you use this, is this purely about using like AED language or are there other differences between search two and Python

Jordan Adler 00:39:48 In this specific case of let’s say an M IR there are representations code because they don’t have those light space or comments or, or other parts that frankly aren’t meaningful to the machine, right? If you’re actually turning it from source code to machine code, like if, if your tools that you’re using to build your code transformer is really intended for code compilers. You not be in a good situation, but you can find versions of this for almost every language that’s out there. And it’ll be very kind of tech stack specific until you’ll have to do your own research, but those are some of the ones that I’ve used.

Felienne 00:40:38 So of course we want to also know about the pitfalls, right? What are some of the things that you ran into one doing this big migration? What are some of the mistakes that we should not make?

Jordan Adler 00:40:51 I mean, I think probably the, there are lots of pitfalls. I think probably the, the, the most immediate one that comes to mind is not all two use cases are the same. So you have that right of you find instructions or guidance that generally I was working Pinterest, we battle in the hell out of that Python future project. And I think that you have to be conscious of that whenever you’re working with code transformer code out there is whatever you’re picking up. Chances are code exists, I guess probably bugs in there too. So I gues as there are bugs with any kind of software bugs that exist in kind of code transformation, software can be very difficult to detect if you’re not kind of being intentional about it and can be extremely difficult to it’s basically codes removed code change, its really hard.

Felienne 00:42:13 So talking about transforming multimillion lines of code projects, what about performance? What such transformation did it take like an hour a day?

Jordan Adler 00:42:25 Well, in the case of Pinterest, right? Our migration took months probably on the order of years, frankly, but you have to think about the project that you’re embarking on, what you’re trying to achieve and kind of what, what desired outcome is before you reach towards a tool. And if you find yourself in a situation where code gets you more confidence as interest kind of Pinterest, right? So a multiyear project could, could, was cut down into so fewer of those, but the running of those tools, those, those manual code transformers, which is one part of that project. And so you have to think about how your project shape is going to be different if you use this technique, if you, if you trying to make a change, you’re pulling in as part of that change, automated incorporating. So if you’re incorporating code transformation as part of your tool chain, for example, that will, as I mentioned earlier with code generators increase your build time.

Jordan Adler 00:43:32 So that become problematic as well. So yes, they can take time to run. There is a performance cost here and depending on how you apply the technique or kind what you’re trying to achieve the tradeoffs may not be there and they may end up being guests. It takes longer to, to actually run the command and I’m spending more time waiting, but I’m spending less time typing the same things over and over and over again. And so that, that is the, the off that you have to think about. And sometimes that takes a view of the timeline at temporal window that is bigger than just the build step or just the actual part of running the code itself, the code transform.

Felienne 00:44:13 Yeah. So I guess what you’re saying is that running the transformation itself in such a big project is not really where, where the performance issues exist because in such a big project, it’s just maybe if it takes an hour, it doesn’t matter if this is a project of a few months.

Jordan Adler 00:44:28 Right. And, and also like we, we chunked it up. So like we 10 pieces of 10 files at a time, for example, out of a thousand files. And so each run on each have a time. Sure. But that process of, and way you there with something that was much than if we had manually done it. Right.

Felienne 00:44:53 So you already mentioned something about making sure that the code was the same because you could deploy it to a, a subset of users and see if not too many errors occur, but that is like the code as the running artifact. But I was also curious about sort of the code as an artifact for reading. Did you also make any improvements while transforming to maybe some, some stylistic issues? Did you also try to improve the code base, improve the readability of the code or at least not make the code readability worse? Because the interesting difference between transforming code and generating code is maybe with code generation, you don’t necessarily need to then maintain the generated code, but with this, these sort of transformation projects, then once you’re done, people will then manually continue to work with the code that you’ve transformed. I, to make sure that this transform code is reasonable for a person.

Jordan Adler 00:45:48 Yeah. I mean, I think I talked a bit about earlier abstracts impact trees and, and concrete Sy trees and how one major difference is that they include likes, right? The parts of the, the code source code that are not relevant to the machine itself, it’s running the code, but rather to the programmer, who’s reading it. And so if you have a code transformer that eliminates those things that removes them right, then, then the output code that you have is going to have those things stripped out and that’s going to less useful to the developer. So certainly that is something that you have to be conscious about when you’re running a code transformer, you don’t eliminate or much space ORs. Certainly you, there also a set of tools out there called you out or something like that.

Jordan Adler 00:46:39 Soter does static analysis, which is basically turn the source code into data and inspect it somehow and return a result, this is bad call or this is a broken pattern or this looks good or whatever, right? So that’s a common case. A prettier will take a code actually like add white space as needed or comments where appropriate breakup lines, do whatever change semicolons where optional, all the stuff that are stylistic changes that historically people would spend lots of time arguing requests here. It’s optional. I, now we have basically a tool that you can run before you check in code. That kind of pretty auto your code. So there’s pre JavaScript land is a tool like this for Python. I think you’re going to see something like this in lots of different languages where there’s sort of like source community here’s the, that or less standardize around every little shop, having own repo for specific to my code base, doesn’t actually improve readability right.

Jordan Adler 00:47:54 In the sense that, what, what really makes a difference to readability is that everyone expects code to look like a certain way. People can quickly look, I see this pattern visually. And so the cognitive process of looking at a piece of text and recognizing calls in a certain way is a lot better when there are markers present or spacing is as expected. And so it’s really important certainly for productivity, not to eliminate that you have that you and its space and comments it’s broken, right. Cause a is not really unless that’s a desired goal, right. In which case you probably shouldn’t be shipping that little thing anyways, cause it’s probably a part of a bigger thing like a compiler.

Felienne 00:48:39 So I guess what you’re saying is that you want to keep comments in place. You want to keep wide space in place and in some situations you might want to, if you are transforming anyway, also run the codes through a pre tool so that the output looks the same in similar cases, making it easier to read for developers

Jordan Adler 00:49:01 Transformation project. You’ll probably want to do that. Pretier run before, right sense pre an auto format. It’s supposed to a Seman, right? It’s supposed have no change to the semantics of code. Just looks different doing that first. And then that big patch out semantic, you can make change easily, then some sort

Felienne 00:49:39 That’s really good advice. Just speaking up my notes. So this was actually everything I wanted to talk about. Is there anything we missed any important tips or best practices or more stories that you have to share about go generation or, or transformation?

Jordan Adler 00:49:55 I think that I talked a bit about kind of the different techniques for actually kind of getting code from text into data. Uh, we talked about reds. We talked about kind of using text markers, ort, and for folks who are interested, learning more that that is a great place to start, start by playing with code take some script that you’ve see if you can turn it into some sort of data object in one way or another and try and manipulate that. And you can use tools that are out there for your benefit. But if you’re really if trying to learn and, and grow what I think it’s, it’s great to build something yourself, even the is out there already. So I would definitely encourage people, get, check it out. It doesn’t take much to try and practice this technique and you, it you’ll find yourself with tool, a new that you use really a superpower that you can leverage to not just yourself, but that’s a win.

Felienne 00:50:57 I think that’s a great of the knowing how to and transform go. It is like a superpower.

Jordan Adler 00:51:04 Oh, definitely.

Felienne 00:51:06 So any places where we can read more about you like your blog, your Twitter, any links we should add to the show notes?

Jordan Adler 00:51:13 Absolutely. I have a website also

Felienne 00:51:36 Notes. The

Jordan Adler 00:51:41 Thank you so much.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)


Tags: , , , , , ,