Interview with Jurg from Green IT Amsterdam
18.10.2021
Our project manager Mohan Gandhi talked with Jurg van Vliet from GITA about what makes ECO-Qube unique, the current progress, occuring challenges, and many different topics including Kubernetes.
You can watch the whole interview here, or you can read the transcript below.
Transcript of the interview
Mo: OK. Good morning, I'm joined by Jurg van Vliet. Jurg has been leading the development aspects of ECO-Qube, specifically the data center assessment platform, which is a core element of the ECO-Qube project. Jurg, I love to get your thoughts for the next 20 minutes. So I'd like to start with the first question. What makes ECO-Qube unique in your opinion?
Jurg: I think ECO-Qube is unique in the attempt to bring all the layers together from a data center perspective. The end goal of ECO-Qube is to reduce the use of energy or at least optimize the effectiveness of the use of energy by adjusting the climate and changing workloads. And that means that you need to travel up the stack and include the software. That has rarely been done outside of the big hyperscalers.
Mo: And that has potentially far reaching consequences for the industry, like you said, because hyperscalers do it, they own the whole vertical. But what this project does is it potentially opens the door for other companies, smaller companies who don't own the vertical. But they can't connect with each vertical. So if we start, we unbundle and silo those vertical elements. Is that right?
Jurg: Um, yeah, in a way. So one of the things that people always talk about is to increase utilization. So if you want to use your stuff, you need to use it to its full capacity and to increase utilization, you need to go up and integrate these different layers.
Mo: And approximately what sort of order of utilization, what improvements, what efficiencies, and what increases in utilization can we get from this project?
Jurg: I'm not really sure what the current average utilization of a small data center is because we are talking about small data centers within the scope of this project. I do think if you have a data center or like a collection of data centers where you have redundancy and a little bit of scale, that an average utilization of 60 percent should be attainable. It’s not higher because you need to deal with calamities such as loss of nodes, loss of service, or loss of network. And then you need to use spare capacity elsewhere. And so I can't answer your question about the relative increase in utilization, but I think striving for a 60 to 70 percent utilization rate is most logical, I think.
Mo: Yeah, I guess we'll know at the end of the project what level of utilization we can get to.
Jurg: Yeah, yeah, absolutely.
Mo: So maybe now if we dive a little bit into the project, what's the progress? What's the state of the project so far?
Jurg: If you look at the start of the project, it was very clear and very well articulated what the different end results should be. And there was one part that was missing, and that is where you collect the data and how do you expose the data for the other consortium members or for the other work packages in the project? And we have at this moment, we have a platform ready to accept the data from the different pilot data centers that we have. We are understanding more and more how we need that data. We call it labelling internally, so if we want to aggregate utilization data, for example, across a data center, we need to label the data coming in from a server. So that's where we are right now. I think especially with the help of Bitnet, who is now professioning both their own data center, but also the EMPA data center. That work has been accelerated tremendously. And I think we can really start to collect at least a core of that data very, very soon. The challenge that we then face is that each data center is different and that we will need to figure out which of the data we need to get to measure our KPIs and those are missing and for that, we need to acquire additional sensors or integrate with other components we find in the data centers. So that's basically the next big challenge to get the whole collection of data.
Mo: All right. So we're approaching a pretty major milestone, which is actually getting these data centers online with the data collection platform. And as you mentioned, every data center is different and that's the same thing in the industry. So the complexity of this project involves understanding the different elements, the geometry, and the sensor suite. Everything is unique and that adds a layer of complexity, which is arguably why it's never been done before. But we're pretty close to cracking the nut.
Jurg: Yeah, exactly. And then you need to fit your data back. So once you have the data, you analyze it and then you need to act. And the acting again is integrating with existing systems within a data center at climate level, but also at workload level. And I think that has not been done before at the commercial, at least in a commercial product outside of what hyperscalers do internally.
Mo: Awesome. I'd like to dive now a little bit into the technical elements that you're working on, and I know you're very good at making it. So maybe you could just talk for 30 seconds about Kubernetes and the role of Kubernetes in this project.
Jurg: So if one of your end goals is to move workloads around, for example move a running process from one server to another, you need to have the surfaces combined in, say, a server pool. That is called virtualization. And there are different solutions to this. But Kubernetes is the most accessible in terms of API. So if we can group all the servers together in one data center, in one Kubernetes cluster, then we can change the scheduler, as they call it, to adjust particular types of workloads or to move them from one part of the data center to another. So if one rack is cold and the other is warm, the scheduler will say: Hey, can I move here? - Yes, OK, then I will move you to that particular part of the data center. An alternative to Kubernetes would be OpenStack or VMware. But I think for the project that we have, it's more interesting to choose one particular virtualization layer, in this case Kubernetes, because then we're going to focus much more on the actual shifting of the workloads instead of building a scheduler for three different virtualization platforms.
Mo: OK, gotcha. So the goal then is basically the smart data center, which is moving workloads to the most efficient part of the data center to be computed there.
Jurg: Yeah, it sounds like a very easy thing to do. Moving workloads, it's just two words. But look at a database for example. In a database you don't want to lose data, right? You expect that to be always there and to at least have consistency in the data that you put in there. So with a database you need to be much more careful with moving around than if you have a process that can die and then you can speed it up again and we'll just continue doing what it did before. I think you will need to introduce different classes and you need to designate your process. This can be moved, this can be moved, but only at night or something and this you need to stay away from. So it is not just like scheduling workloads to be moved around, but something you need to be very careful about.
Mo: That was going to be my next question. There are a number of challenges: there are technical challenges, there are business challenges, there are ownership challenges. So I wonder if you could dive into what are the biggest challenges that we're facing in this project?
Jurg: I think the biggest challenges that we face in the project is to implement the virtualization layer across the whole datacenter. I haven't seen that yet. And that that is something that we need to have before we can start moving around this. I think that’s the single most challenging aspect of this at this moment. After that you can get the data out and then figuring out where to move something is not the biggest problem. I think the biggest problem will be to feed that data back in a way that you can have a timely reaction. Kubernetes itself, I know better than OpenStack or VMware. So moving workloads in Kubernetes around shouldn't be too much of a problem. But if you start integrating with your climate system I expect there to be interesting challenges because those systems are not very simple either.
Mo: Yeah, so said, there's complexity and there are challenges about integrating the different systems.
Jurg: Yeah, absolutely. Especially feeding it back in a real time manner.
Mo: Absolutely amazing. So last question and sort of an open question, is there anything else that you'd like to share with us about this project? Maybe you know where you see this R&D work, the effect that it could have, where you see the industry going in five years time as a result of the work on this project. Is there anything you'd like to add?
Jurg: Yeah, I think so. I used to be closer to this industry 20 years ago, so I've been doing other stuff between then and now. And if I look at the industry from a sustainability perspective, there's a lot of focus on the stuff that they can relatively easily attain. And it is not fair to say it like that, but you can optimize your energy efficiency relatively easily. For example isolation, and you can do a lot of these things. What is hard to do is to measure effective use. And that is, I think, one of the more interesting aspects of this project, so is the energy spent on a particular process used effectively to get the work done all the time, or fast enough or something like that? So what you could see if we would shift workloads and we could have a more balanced workload distribution, you could see that your energy use would go down, which would express itself in a PUE which would go up. And you see what you might do to work much faster. You see what I mean? And then the traditional way of measuring data center energy utilization is not sufficient. So you need to have a different view on what is being done in a data center. So you see in research that people are thinking about it but it is not being made very practical. So once we have the data collection ready, we can work on that aspect as well. And then we can really assess if the other measures that we are taking at the end of the project, if they are effective in relation to workloads.
Mo: So this is a really fantastic point. I think you touched on two points here. You've touched on what is the unit of work that we can make more efficient. And I know you've written some of these deliverables, which I think will be released. You've written about, let's call it the ‘van Vliet metric’, which is, I think, billions of operations a second. I think I'll let you talk about that later if you want. But the second element is what you mentioned about PUE. There are a number of KPIs in this project that we are trying to improve. Energy reuse factors, renewable energy factors, PUE, etc. But it may well be that we develop new or better KPIs as a result of understanding that maybe our existing KPI suite is not the most optimum. Exactly as you said, maybe if we're optimizing to a certain metric that skews what is truly efficient or truly most optimum, I think many will be at the end of this project in two years time, we may have two years worth of data where we can say, actually, there's evidence that we should be using something else or something in addition to PUE.
Jurg: Yeah, exactly. It's interesting to illustrate this. Suppose you have two different servers and one is very fast and the other one is very cheap. You have different performances.
Mo: Every workload is different, right? So it's like comparing a sprint to a long distance runner. One is optimized for speed, one is optimized for distance and there's no way to compare them or to combine.
Jurg: Yeah. Not yet. So we are looking at ways to compare these two different things. So you can do a lot of work on hardware from three years ago, which might be in itself energy efficient, but because it's there already the overall impact is higher. And these are the things that you want to capture in this because that's the true gain that you want to achieve, at least from the perspective of a full stack approach to use software engineering terms.
Mo: So we'll bring the interview to a close now. Thank you for your contribution, and in summary, what this project is doing is trying to bring hyperscale levels of control and efficiency to small, distributed data centers that have not traditionally been able to exploit such efficiencies for various reasons like ownership issues, integration issues, complexity, etc. And Jurg today talked a lot about the technical milestone that's coming up, which is actually connecting the data centers to the assessment platform and the technology around that. So thank you, Jurg, for your input. I give you the last word in three words: How would you describe this project? Or maybe just one word? I'll put you on the spot here.
Jurg: Smart energy management system. That's forward. So the one word I got, I added to the three, and then you have smart energy management system.
Mo: Smart energy management system. Perfect. Thank you. Thanks for your contribution today. And we look forward to our next interview with Ender. We're here to talk about the CFD element of the project. So thank you.