There is no shortage of options when it comes to real-time monitoring applications. There are several that allow preventive as well as corrective actions notifying the administration team well in advance. However, as Metric Collector and Metric Storage, Telegraf and AWS CloudWatch, respectively, offer much more than this. They are convenient to set up, can be employed on-premise or on the cloud.
AWS CloudWatch is a tool from Amazon that allows system architects, developers, and administrators to help them perform real-time monitoring of their AWS applications hosted on the cloud. …
Containerization technology is greatly simplified GPU computations for machine learning by wrapping software stack above Kernel level in containers and allowing to juggle with different combinations of frameworks, low level libraries, and hardware drivers. Technologies like nvidia-docker even unlocked new stack combinations (driver plus CUDA toolkit) that were not feasible before. This, however, adds another dimension to the performance optimization problem, and now you not only need to choose optimal hardware for your machine learning task but also variate driver/CUDA combination to fine-tune the performance further.
The goal of this short article is to illustrate that such optimization is…
There are a lot of guides on how to install Nvidia drivers in the wild, but still, when I needed to set-up a system where I can easily switch between driver versions for our research, I spent several days trying to figure out the most seamless way to achieve that.
Here I want to summarize multiple days of my googling, trials, errors, and the final solution I came up with which appeared to be way simpler than I thought.
I am using Ubuntu distro as a reference, the process should be the same for any version above 16.04. For…
Proper monitoring is a foundation for mindful configuration management. It’s a prerequisite for any optimization effort. Monitoring is useless without the ability to act on the insight.
What most monitoring solutions lack is easy access to instant and relevant action upon insights you gather. This post is about a bot-based notification and management layer that we’ve created to play with the ways we can deal with that exact problem and that’s, for now, is available in a beta mode for our selected customers.
We in the RC team resolve to provide three simple things — an ability to come up…
We at RocketCompute regularly optimize cloud HPC configurations for different workloads (e.g. this case for weather prediction). It is quite common for us to see the price or performance variation as wide as 4x-5x for different HPC cluster configurations. This variation usually heavily depends on network latency within the cluster and unique requirements of a workload (algorithm, software stack, RAM and storage requirements, etc.)
When we started looking at deep learning workloads we discovered that our colleagues most of the time use one of the following rules of thumb while picking cloud configurations for training DL models: