Awesome. Thank you, everybody, for sharing your details on come and we see that we have people from everywhere around the world joining us here. A Lot of folks from India, Portland. We see people from Seattle over here, in Sunnyvale, Santa Clara, S.F. as well, Vancouver. Hi, everybody, glad that you could join us today.
So we have all of our agenda on DevFest West Coast. I have the honor to invite one of my friends, Hannes Hapke.
He's a senior machine learning engineer at SAP. He's also Google Developer expert on machine learning. He also recently wrote an amazing book on building machine learning pipelines, which is now available for sale as well from O'Reiley. So feel free to take a book, go order on Amazon and yeah, for a summary. I'm assuming what that book is on. It's become.
Thank you. Thank you very much for that very kind introduction and also thank you for having me today to to talk about Tensorflow extended. And it's great to be part of that amazing DevFest West Coast 2020 Yeah. Let's jump right into the presentation. So what I want to talk about the next 30 minutes is how we can take our machine learning models we have trained so in our experiments and how do we turn them into pipelines so we can use them consistently in products. Just real quick, as mentioned, I work at SAP. I'm one of those GDC, as you can see in that photo here, and together with Catherine Nelson from SAP, both co-authored the O'Reilly publication of Building Machine Learning Pipelines. And today's talk is a very, very, very brief summary. And some slides might be abbreviated and we might rush through some content here. So if you want to the complete deep dive, I highly recommend the publication.
Before we get into all the technical extended details, let's quickly talk about machine learning engineering. When we normally talk about machine learning and presentations, we talk about algorithms, we talk about new architectures.
This is basically the orange or the orange box here. So we talk about machine learning frameworks, point towards versus TensorFlow, et cetera, et cetera. But when we want to take those train models into production. The ecosystem looks much bigger, so all of a sudden we have to take a look at how do we keep updating our models, how do we analyze the models, how to be certain things like this? And those topics is what we understand these days with the term machine learning. So when I describe my job, I often say like it's sort of like the machine learning plumber. You take the data, you turn it into a model, will help with the training, and then I help with deploying those models and capturing the data from those deployments.
But I'm not focusing on the actual ML code that's what, in my opinion, designed to do, what the experts say. But then when we take a look, it's like, why do we need machine learning, engineering?
There's often the conception in the machine learning field, then "hey, I trained my models in my notebook" Well, there's a couple of things happen happening here. Yes, you get a train model out, but there is no focus on reproducibility. There's no focus on traceability, which will become more and more important in the coming years. And we want to reduce the burden for data scientists. That's what I think the purpose for machine learning engineers is. We want to make it as easy as possible to update models, because if you read or data scientists release models, we can spend two hours each week for each model to update that and keep it up to date. That needs to happen automatically. And I want to show in the next 30, 25 minutes, it's basically how can you do this with Tensorflow. All right, so let's quickly talk about what happens to train models in most situations. Well, sadly, they usually don't get the point. Let's let's face it, because the whole ecosystem of machine learning engineering has been too cumbersome and complicated in the past. And a lot of data scientists have given up on bringing out their amazing work into a real-life scenario.
There is always a funnel. So let's say you have a thousand experiments. Maybe you tune one hundred of those models and maybe 10 of them get deployed. But we might miss our wonderful opportunities to contribute back to society with machine learning and help with applications. If if the deployment or like the operation of those machine learning life cycles is too cumbersome. There are some other aspects. Some models do get deployed, but then all of a sudden we have a whole different set of problems. Those models experience data drift, their schema might change. So we trained with one set of data and then three months later, maybe the database schema changes and all of a sudden our training data downstream changes as well. We see training-serving skews. There is like complicated preprocessing steps. Often they get updated without updating the model. The whole retraining is really complicated. When I speak with other data scientists, I hear stories of like maybe we don't update models because it's just simply too expensive and takes too much time. And then we have like the whole ecosystem of like deploying machine learning models when it comes to like, how can I reduce the latency of my predictions and things like this. So this is what we describe as the machine learning lifecycle. So basically, if you would draw it out in a very generic lifecycle.
So it always starts with the ingestion of the data. We validate the data we pre-process or do some feature engineering there. We train a model in some situations. Also we train our models. Then there's a very critical step, which I think often gets left out. When we talk about machine learning, lifecycle is like before we deploy the model and after we train that the model needs to be seriously analyzed and also validated. And then once the is valid, once the model is deployed, we really need to capture the feedback about the model because that's our lifeline and we can also use the feedback to generate more data. So that's why it's it's an endless cycle as we associate. And as you can see, and as as you can imagine, those systems are highly entangled, so meaning if you change some parameters during your training, your model analysis and model validation will look very different. You might have to validate it with different thresholds and your service might look different. So if you change one parameter, it has a whole downstream effect. So doing this manually will take a tremendous amount of time. And that's why the focus should be on automating those machine life cycles. And this is where CFX or what we call intensive extended is here to help basically Tenzer folks in to provide you all the tools to automate your machine learning lifecycle.
So if we take a look at this, this is basically the individual components. So we have like components in our machine on the pipeline here. We start with the data ingestion. We do some, again, data validation for processing. Sometimes people do the data validation after the pre processing as well. There are good reasons for that. But for simplicity, I've left it out here in that in the overview, we train our models, we validate them and then we deploy them. And this all runs in the technical ecosystem. And then we can express those components and orchestrate those pipelines in various systems. And I'll talk about this in a in a couple of seconds. But also, if you look at those components of textbooks and it is providing so basically for each step you see in those blue boxes, you have the white boxes, which are the individual components you can use from the ecosystem to build your machine learning pipelines. And then you have the orange boxes, which are the standalone libraries from the transfer ecosystem, which you can use for your for your pipelines in some of those components are a framework agnostic. And so if you have a particular model, you can also use that with those components.
Yeah.
As I said earlier, once you design your pipeline, you can orchestrate that pipeline and you can run it in various ways. You have basically had four different ways of running your pipelines. The first one, you see the Jupiter hub notebooks there. That's technically not a pipeline, but in the ecosystem, it's called the interactive context. Basically, you can run the individual components and you can debug those components. But some people use that in a notebook just to string a pipeline together. And if it works, the exporter Talbert's of packaging airflow and them there will be a torque right after mine where you get a wonderful deep dive into capful. So I highly recommend staying on what's the topic car. You will find out all the details about what I want to stress here. If you don't have control setup, maybe you work in a company or in a project where you already have a patchy airflow. You can also orchestrate your machine learning pipeline with your. All right, let's take a quick look at how we can run to events in action. And a couple of months ago, my colleagues and I Concur Labs, we wrote this blog post at the principle blog of how we use models in connection with flow extended pipelines. And the reason for implementing that as a direct pipeline was a, we didn't want to update our model frequently and B, we wanted to take advantage of tends to transform for the in the interests of the time. I will not get into all the Henslow transform details, but there are two block goals with all the details. If you're interested in how we use it and why we use Censullo Transform for our models, I highly recommend that blog post, as you can see here on that site.
But let me walk you through how we set up the pipelines. So, again, first, any pipeline starts with the ingestion of the data. So here we have we define our input data pipeline, intense flux, and it has wonderful functionalities. One of the cool functionalities is you can ingest the data from a folder or from some Clutterbuck. It could be GCP, S3, anything but many other supports. And then. You can't split those data according to a. You said it doesn't need to be a double split. So you can say you can also have like an extra test set. It's totally up to you. It's highly configurable. But this this split will be then considered in the pipeline for your training and for your validation and potentially for testing. And then once you have all the inputs and outputs configured, you can ingest the data. And for that purpose, has flow extended, provides you with numerous components. So in this case, we're loading of record files. We already preprocessed. This is a very generic import here. The reasons also we use a public data set that was already available in state records. There are components out there for bakery connections. We internally road link components for power connections, sort of like an ASCAP customer. You can easily connect there. There is like numerous ways of ingesting the data. The wonderful thing I really want to stress here is under the hood and you don't have to touch it. You really don't actually don't see it under the hood. It is executing always everything on a patch. So if you have like millions of files, Apache Beam will take care of the distribution of the data ingestion, that is.
Once the data is ingested, we can do two amazing things.
Number one is we can generate statistics of your dataset. And here I want to stress this is basically the one liner is all you need to do. You basically point the statistics generator to your example data sets you just ingested and then it will generate the statistics. And I'll show you exactly how the second. Once once the statistics have been generated, you can also generate a schema and I think a schema is often overlooked when we talk about machine learning pipelines or patterns. This will give you information of like what are the columns? What are the features? Are the features in of sparse what? It's basically if you have categorical features, what are the labels? And the wonderful thing in terms of flow extended is those information will be stored in the metadata store and you can later compare against previous rounds. So, for example, if we treat a machine learning model today and we ingest new data and let's say the data has updated or the schema has changed in the past, the pipeline would hire us and say hello, wait a second, there's a new column or maybe there's a new label coming up in a category of a field. We can set thresholds where we can say if a data drift in a certain direction stopped the entire pipeline and do not produce anything but a word, data centers. So use an example of a visualization of the data which is generated also, and you will see this later on in the queue, for example, from Carl, you can drill into the individual components at some of those components have visualizations.
So while the pipeline is running and continuously retraining the model from time to time, you get those amazing visualizations where you can drill into each feature, you can see the distributions, you can even compare the distribution and the data sets. There's a lot you can do with tens of data validation. So once we have the data not just suggested, but also generated the schema and we have the statistics generated, then we can start diving into the feature engineering and we can do this year with tends to transform. This would basically this meets its own talk to talk about Pennsville trends. I'm sorry, I just leave it with that. I just would say that you can express your data processing processing steps as tens of follow ups. And because you do that as kinds of follow ups, you can build a graph of this. And later on, when I show you how we explore the model, you can attach the preprocessing graph together with the graph from your train model and exported one out of artefact. This is a wonderful way to avoid a training certain SKU when we serve machine gun violence and to always make sure that the pre processing step is matching your latest machine learning. So we just defined those technical ups tend to flow, the ecosystem provides numerous wonderful manipulations. So if you're in the computer vision field, you can look into image manipulations from tech images or if you're work with problems, there's text where you get very powerful tokenized for modern transformers or like you need simple screen transformations.
You can always express them through intensive flopsy states. And if not, then under certain circumstances, you can also use to functions to express your operations here. Once you have to find this, you save it in a model. And in our case we call the transformed pie and then we pass that to the transfer component where the component itself will look for the pre pre processing and function. That's basically the hook and then it will apply those preprocessing steps to the adjusted data. Once the data is transformed, we can start training our model. This is a way which was recently introduced in Texas where we can use a generic trainer, which allows us to use KERIS models. If you look at previous documentation that always introduces or we always had to use to estimate the architecture. This is not the case anymore. I want to stress this. You do not have to train an estimate at this point anymore. And but the wonderful thing is because you have the preprocessing steps performed, we know the schema after the preprocessing so you can easily open your machine learning models. That's basically where your data, not the models. We're training them here. We can apply distribution strategies if we need to currently tend to flip, transform supports the right strategy because we can spin up individual external clusters to support more than the use we have on hand.
But that's at least or that's so far a really good way of distributing training capabilities. We load our models as usual, like you define your models as through the functional API or through the sequential API, as long as it's a carousel model. That's what I like to do, is I prefer to pass in the transform output information because then we can dynamically create the inputs to the model based on the lack, the produce, the training and the serving. Sorry, the training and the saving is exactly as we as we know this from previous Denslow or Carus models, we call it. And then once we're done with the training, we set the model for teacher evaluations and again, all those steps we export them to a module pilot. In this case, we call this training a pilot and then we pass on this information to the component and say all the training information will be in. All right, we can also do the moral analysis there, the analysis steps are fairly lengthy, are really abbreviated this year. I highly recommend if you're interested in how to configure this in detail, take a look at the documentation from CFX or our publication. There's a lot this is a really, really powerful component because what you can do here is you can say, actually, the model was trained. Evaluate the model with certain metrics and the metrics are not just accuracy. This could be like an F1 score.
It could be a precision recall. It could be whatever you want to define. And then you can take a look at specific slicing specific specs. So normally when you train a machine learning model and you run your validation set out there and get APIC, you would run it against the entire data set. And then you see, like maybe an activity number or some loss or maybe even if one score for the entire dataset. But let's say you train a machine learning model and your data is like 80 percent from one country and 20 percent from another country. Would you be trying to predict, like, let's say, restaurant reviews? You want to make sure that the amount of performance per country is similar to the model transformer's performance for a country? But it's you can never see this from the model validation during the training. So this is where the model analysis becomes extremely powerful because you can slice by countries and then make sure that the accuracy or the the F1 score, whatever metric the use is similar to a country. And then if maybe there's an outlier, you can investigate further or stop the deployment. This is extremely powerful. So really quick, as I said, you define aspects, you define your metrics, and then you can either slice against the entire dataset, which would be very similar to your normal validation of doing the training, or you pick specific columns and slice by those columns and run the metrics against those columns.
Super powerful.
And again, this is a this might be a very, very small to read right now, but like it generates statistics and those statistics can be visualized within your components when you execute them, for example, with crude oil pipelines so we can drill into those components and then see the statistics afterwards. And the wonderful thing is also those statistics are preserved. So even if the model has shipped. We can always go back to our component's. Sorry, and then investigate maybe in two months, how did the training actually look like? So if a customer comes in and says like, hey, this model wasn't performing as expected, we can see, did we miss something? We have all the metadata available here. We can also compare the different metrics, et cetera. So that's all. And then the last step in our short pipeline here is the model deployment and the model, the Klement works hand in hand with tens of concern. You actually do not Yuda you'd actually don't have to sit outside or coming out of the pipeline. You basically push the model into a location where models serve like tens of listening or a.m. or an unexpected room set up could pick it up. So you can push those components so you can push those models to investigate or to a GCB bucket and Denzel Washington could pick it up. But also this push component can create like model serving endpoints in GCP or we have done internally. We have we have created components to create speechmaker endpoints if you're in the database, so that this gives you a lot of flexibility of how you want to deploy machine learning inputs. But the wonderful thing is everything up to this point has to work out. It has to pass the data. Validation has the best, the training it has to pass the model analysis and the model analysis. And if I were to mention this earlier, it also compares against previous rounds. So we can we can set the criteria that a model has to be better by X percent than our previous models we have pushed out.
And then once we define our pipelines. Nothing has been executed so far, and we can orchestrate them with the framework's or with the libraries I put up here on this flight.
So if you want a quick and easy orchestrated. Way of using your pipelines, you can use Apache Beam, it doesn't have a front end, but it makes sure that those components are orchestrated. If you already have systems in your company or in your project and you use airflow, you can easily orchestrate your pipelines with a picture full. And if you have Coupeville running or communities running, it's very straightforward to set up your flow and you can orchestrate your pipelines with beautiful pipelines. And that is a system which is specifically designed for managing pipelines. And you get all the wonderful benefits are so.
There's one last step you need to do, you need to basically. Define which components go into your pipeline, you then define the configuration for your pipeline.
So you say where's your ability to store? Like, where do you store the information from each component? And so that the next component can adjust it, you can set configurations of like how many GPS they want to use. How many gigabytes of memory do you want to claim in your cluster if you Google? Etc., and then you run your paper. And so what I want to highlight here is if you're run, if you run your pipeline on air flow or actually is more beautifully executed, if you run it with the flow, then rohner, it will basically convert whatever you confront configured in Python and convert it into an ideal configuration file, which you then can use either straight with Argo or with the pipelines as call will demonstrate in the next talk. One thing I want to stress here is metadata is everything I see a lot of talks about machinery, pipelines, and there's not a single word mentioned about metadata, but no component could connect to each other if we wouldn't have the metadata. So text provides you the metadata store. It will take care of the communication between the components you don't have to set up, which are the inputs, which are the outputs. That's what the components provide you. Even in the control DSL, you sometimes have to fix those components together.
So if you want to simplify your life, the text provides you all those premade configurator premix components. If you want to configure your own custom components tend to float. Extended is highly flexible so you can buy the components from scratch or you can inherit existing components, manipulate some executables. So there's a lot of flexibility for you in that system. Is it a quick example of how we run this, the pipelines? This is an example from our book we read. We have a public data set of customer complaints and we want to figure out which customer will complain or not. So it's basically incomplete pipelines. We converted our python set up into an argle run configuration. We learned that the pipelines and then we can set up this is a menu where you can set up a runs or daily runs. How often you want to trigger that. You can also trigger that through S3 three buckets. There's a lot of ways you can update triggering a lot of updates. And this is one thing I want to highlight here is the lineage, that's the the modern lineage which comes out of the metadata store. So once we have trained a machine learning model and let's say we shipped this model, we can always go back to the metadata store to say, OK, we shipped this model at this date, let's retrace it.
What was the input? Which model should be used which had the parameters of who signed off? We use like components where humans sign off on multiple deployments and we can check which data scientist did develop the evaluation, what were the criteria, et cetera, et cetera. So this is a wonderful thing where we can say this is also preserved so we can always go back. That's our record of what led to that. All right. I'm getting short on time, so we quickly want to just summarize the reasons for TMX pipelines. As I said, the metadata score is there. It helps us to make our machine learning pipelines reproducible and also very consistent. We can automate the model updates. We get an audit trail, as I just showed you in the previous line, there are ways we can bring the human into the loop. You can have a little revenue component in our setup over somebody in the select targets, gets Connacher gets contacted. If a model reached the certain states and the pipeline, evaluated the model and said, hey, this looks great, Canada can look over the results and sign off, and then we approve that or are rejected because we can also automatically convert models to tens of Footlight intensive gutsiness.
This is with tens of élite becoming more and more powerful. This is extremely wonderful. We can train two models in parallel. One model goes to a machine learning server. In the other model is being shipped to the mobile team. We can tune as part of the pipeline and then the best from the parameters of the best run will be used for final training set up. We can do pipeline birching, so sometimes we want to have a canary model and something you want to have a test model with maybe slightly different type of parameters. We can train them in parallel and we have finally a way to do continuous model the. If you're interested in all the details, I rushed through a lot of details here. I highly recommend that publication. This is out this week. So it's brand new and everything is on up to date with the latest pink slip versions of extrusions and Lucian's. Yeah, was that, again, if you're interested in all the details of the example, should you check out the chemical blog where he explained in two parts the details of how everything works inside with Bert and with that, thank you very much for taking the time. And yeah, there's a couple of minutes left. I'm going to take some questions.
Awesome, thank you, honest. I don't see a lot of questions, but people mentioned something about data. If you want to talk a little bit more about that.
Unlike David, maybe I can sense something that you have seen that sort of did the favor, for instance, of data drift model like, for instance, that you can talk about.
Yeah. So it's it's often the case when you train a machine learning model and you put it out into the wild, people who react to that and they they learn how to tweak that model to their benefit and that usually becomes visible in the data. So all of a sudden, certain labels don't show up anymore or show a dominantly and then your entire model gets gets bytes in a certain direction. And that's the wonderful thing with the tens of thousands of validation tool. We can generate those statistics and compare them with every data. Just so I highly recommend that.
Kinslow Awesome. Fantastic.
Thank you so much. Thank you so much for the talk and looking forward to the book, actually.
Thanks for having me. Yeah, it's quite a long journey. I'm looking forward to have the print copy for myself so far.
So you're excited about it. Awesome.
Thank you. Thank you. We will have next speaker joining in soon. Thank you. Thank you.