Holograms and Data Science

This will be a quick and different post, recording one of these A-HA moments we sometimes have, before I forget it. It is a random thought that’s got more than 140 characters. A tweet would not do it.

I just finished reading The Black Hole War, by Leonard Susskind (one of the most important physicists alive today). It is a great book if you are interested in modern physics. Here is an excerpt of my review on Goodreads:

(…). For non-physicists like me, this was a fantastic introduction on what we currently know about quantum gravity and its relation with other areas of science. As a bonus, it also (finally) helped me start grasping string theory, and better understand entropy, the event horizon (complementarity, information paradox) and the holographic principle.

My A-HA moment came while the book discussed the Holographic Principle. Coincidence or not, I have been studying a whole bunch of Machine Learning, and things converged. I love when (seemingly) completely different areas of science suddenly converge. It reinforces my belief that there must be something common underlying all of it (call it Grand Unified Theory, or God, or as you please).

Principal Component Analysis and Dimensionality Reduction are the Data Science (Machine Learning) topics in particular that seem to be correlated with the Holographic Principle. They are techniques of removing redundancy in data sets and extracting the minimal data necessary to represent something. It allows you to reduce the number of columns in a database without losing meaningful information, for example. The main technique can be seen as a linear algebra transformation: a projection of the N-dimensional data onto a smaller K-dimensional space.

In a sense, holograms are the same thing: a projection (encoding) of a higher dimensional space onto a smaller dimensional space (3D to 2D, for example). The Holographic Principle states that all information inside a N-dimensional space is contained into its (N-1)-dimensional boundary. For example, all the information inside a volume (3D) is contained (or described) onto its surface/area (2D).

This is fascinating. Holograms are fascinating. It could mean that not all dimensions (e.g.: rows/columns in your database) are necessary to perfectly describe any information, and removing that redundancy is precisely what Dimensionality Reduction (and PCA) tries to do. I wonder if we can use the Holographic Principle to find ways to do Dimensionality Reduction without losing any information (a loss-less compression, if you will).

Linear algebra is another fascinating aspect of all this. Projecting data and extracting meaningful information (compression) always seems to involve the calculation of Eigenvectors and Eigenvalues. Google is constantly calculating Eigenvectors that power your searches. Somehow, Eigenvectors also seem to be central to all of this.

I am sure this is not something I’m inventing or discovering and there are plenty of papers about it out there. I only happen to not have stumbled upon any of them. A quick search tells me that Holography and Dimensionality Reduction is being used in many different areas of science, including genetics and biology. If you know of any of such papers (proving or disproving my random thought), let me know in the comments.

Fabio Kung Leave a comment
spacer

My DockerCon 2014 talk: Thoughts on interoperable containers

Recently, I finally had a chance to read Presentation Zen: Simple Ideas on Presentation Design and Delivery, by Garr Reynolds. I wish I had done if before, after so many years of giving less-than-ideal talks and ignoring recommendations from presenters I admire, who give presentations I highly respect.

It is usual to post your slides somewhere public right after you give a presentation. I’m sure that Docker organizers are going to publish my slides somewhere, but here they are if you really want to see them and don’t want to wait. However, if you agree with some of the guidelines from Presentation Zen, slides from a good presentation will not have a lot of useful information. Slides are there just as a visual aid to the story being told during a presentation. They should be highly visual and help illustrate your points. A slide deck is not a document.

I want to try something different this time. This blog post is an attempt of a proper handout of my slides. I hope it is a bit more useful than a bunch of slides with beatiful pictures and small sentences to all of you who could not attend DockerCon and didn’t see my talk. Or for those of you that attended (thank you!) and want to review what I said.

spacer

We want to run our apps, unmodified, everywhere. I too have been trying to find ways to make it happen, and I want to share some of the things I have discovered so far.

spacer

I run Linux Containers at Heroku. Lots of them. Heroku has been running Linux Containers (dynos) for more than 3 years. Maybe 4 years, I don’t know exactly when the move from chroot jails was made.

Docker has so many different uses! Every other week I discover people using Docker in a different way. For example, some people are now distributing CLI tools as Docker containers. Here is (more or less) how Ruby VMs are being built these days at Heroku (thanks to Terence and ENTRYPOINT):

docker run hone/ruby-builder --version 2.1.0p123

# a new ruby VM is now available in a local directory

Compare that with a ruby-builder binary that would do the same thing: the containerized version includes all the dependencies and works reliably everywhere Docker is available.

We are interested in a particular usage for Docker though. Docker to run portable, server side, web applications following the 12factor.net guidelines.

“Write once, run everywhere” is the dream. In reality, each group (app developers, PaaS providers, Docker developers, LXC developers, Linux Kernel developers, etc.) has different priorities. Different trade-offs and optimizations are being made by each one of them.

  • Developers want apps: “hey provider, just run my damn container!”.
  • PaaS providers want scale: they must scale as a business, with sustainable operations, while being fast and secure. Arbitrary Code Execution as a Service.
  • Docker wants to be the right tool for many different use cases. It is a toolkit that will enable many different things, but we can’t expect Docker developers to solve all the problems. They are too busy already!

spacer

I have been navigating this problem space for a while now, playing all the different roles. At the end, besides building a PaaS at Heroku, I am also an app developer and I enjoy hacking on open source container managers (Docker, LXC, the Linux kernel).

spacer

Of all the slides with beautiful pictures, this is the only one I can truly say that I made it myself. :-)

I have been trying to reconcile all sides and would like to share some things I learned. Ultimately, I want to hear from you: tell me what is bullshit and what wrong assumptions I’m making, and/or if you agree. You can also contribute and help make this happen with ideas, design and code.

spacer

The container shipping analogy is the first thing that comes to mind. You build a full container locally then ship it to Platform and Infrastructure providers. A Docker container would run anywhere.

trying to make Docker secure for multi-tenant scenarios is a can of worms

– darren0, at #docker-dev

I agree with darren0. Most of the challenges I faced so far when trying to make it work as a Platform provider are differences between local or small Docker environments (single or a few apps on a few servers) versus multitenant environments where boxes are packed with containers from multiple different apps (tenants).

spacer

At Heroku, we run millions of containers (apps). It is not hard to imagine why trade-offs need to be different than with a smaller environment.

One of these challenges is root access. When you build a container locally or in your build servers, you have root access inside it.

spacer

Root access is controversial. For some people it is obvious why containers can’t have it in a shared (multitenant) environment, or any production environment for that matter. Others have a hard time accepting the idea it is dangerous and believe that containers should be safe enough to cover all issues. Or they don’t want to care and just want the runtime environment to solve all the problems.

App developers often want root access because:

  • apt get install ..., they want to install packages, tools and libraries.
  • vi /etc/..., they want to edit important configuration files.
  • mount -t fancy ..., they want to mount filesystems and use what the kernel has to offer.
  • modprobe something, they want to load modules and use what the kernel has to offer #2.
  • iptables -A INPUT ..., they want to configure firewall rules, port mappings, …

spacer

The big problem is that root access in a container is also root access on the host system. The same host machine that runs many other containers for different apps/tenants. If anyone can – for any reason – escape a container, they would be able to do anything they wanted with other containers in the same box: look at other containers data, code, and potentially escalate to other components of the infrastructure.

This used to be much easier. To be honest, things are much better nowadays, and I will admit that there is a lot of FUD. Jérôme Petazzoni gave a nice talk about it earlier this year, go check it out:

LXC, Docker, Security

Still, you need to always be careful and do the right thing when running containers. Thankfully, Docker, LXC and Heroku (the container managers I’m familiar with) are in general doing a good job IMO: dropping capabilities, using AppArmor/SELinux/grsecurity, kernel namespaces, cgroup controllers, etc. But anyone still can do bad things on top of it and leak resources inside containers. Just as an example, I’ve seen people mentioning they are bind mounting the Docker socket (/var/run/docker.sock) inside containers. Which basically gives containers full privileges on the host.

But, escaping containers is not the only problem with root access. In the Linux Kernel, there are still many parts that haven’t been containerized. Some places in the kernel will hold global locks, or expose global resources. Some examples are the /sys and /proc filesystems. Docker does a great job preventing known problems, but if you are not careful:

# don't do this in a machine you care about, the host will halt
docker run --privileged ubuntu sh -c "echo b > /proc/sysrq-trigger"

If you are not careful protecting /sys and /proc (and again, Docker by default does the right thing AFAIK), any container can bring a whole host down, affecting other tenants running on the same box. Even when container managers do everything they can to protect the host, it might not be enough. Root users inside containers can still call syscalls or operations usually only available to root, that will cause the kernel to hold global locks and impact the host. I wouldn’t be surprised if we keep finding more of such parts of the kernel.

This is also one of the reasons that many people are choosing to run one container per box (the other being better resource isolation), physical or virtual. One container per VM is a nice way to leverage the advantages of both containers and hypervisors.

I am happy that this is getting better and better with time, and most of these concerns are not really a big deal anymore, as long as you use bleeding edge versions of Docker, the Linux Kernel, and AppArmor/SELinux, etc. I can see a light at the end of the tunnel that would allow Providers to permit root access inside containers. Some Providers even started already running code as root inside containers. However, it increases the surface area of attack considerably, and we can’t blame providers for trying to avoid malicious tenants (or even unintentionally) DoS’ing their boxes or getting access to other containers (apps).

spacer

User namespaces (a.k.a. unprivileged containers) were recently added to the Linux Kernel to provide safe root access usage inside containers. It allows regular (unprivileged) user ids on a host machine to be mapped to arbitrary user ids inside a new namespace.

That way, root users (uid=0) inside a container (namespace) can be mapped to regular unprivileged users outside of the container. Root users inside user namespaces are not necessarily root on the whole box. It is a better implementation of fakeroot enforced and controlled by the kernel. Nice!

spacer

This is very recent though, and will probably take some time to stabilize. It is not hard to find exploits from earlier this year (a few months ago).

Because of the way it works, by design, it can also be made dangerous. By default, root users inside new user namespaces have all the kernel capabilities (superpowers) inside that namespace (container).

Providers relying on user namespaces to provide (fake)root access will still need to drop capabilities, otherwise (fake)root inside containers will still be able to call syscalls and operations on the kernel that only root users can call. With that, they can potentially DoS boxes or escape containers. Another simple example is mknod and mount: if a container is not protected appropriately, even with user namespaces, a (fake)root user could just get raw access to local disks in a box, mount them, and start reading all unencrypted data in the box, including data from other containers. Again, Docker does the right thing AFAIK, as long as containers don’t run as --privileged. But, Docker doesn’t support user namespaces yet. I’m sure Docker developers are thinking about it before blindly adding support.

Another attack vector is code like this:

if (getuid() == 0) {
  // do root stuff
}

Who has not done this before? Blame me, I certainly have. This is how auth was intended to be done on Linux systems, before capabilities and user namespaces were added. Nowadays, in order to be safe, that code would need to check capabilities in a specific user namespace.

What code like this is out there? Who knows… but if you are uid 0 in a namespace you will be able to pass all these validations.

When a root user has all its capabilities dropped in a new user namespace, it is very similar to a regular non-privileged user. None of the things a user would want root access for are possible. It might as well be a regular, non-privileged user. Most (if not all) of the things that require root will not work anyway.

Just don’t run as root?

As an app developer, why should I care? Providers then can just run app code as unprivileged users and end of story!

Unfortunately not: setuid binaries are a problem in this scenario. Binaries with that permission (bit) set are executed as the owner of the file, not with the permissions of the user that runs the binary.

Injecting arbitrary code as setuid binaries owned by root is a very well known old attack to execute code as root on UNIX systems. Container images built locally can contain anything. If PaaS providers accept arbitrary images to run, they must be very careful to remove all setuid binaries owned by root, or disable setuid completely on all filesystems (with the nosuid flag and AppArmor/SELinux, for example). Anyone can inject any setuid binary owned by root in container images being built locally.

Providers also need to be extra careful with what gets injected into containers, e.g.: bind mounts with --volume or --volumes-from. A filesystem without nosuid leaking into a container allows malicious users to create and execute arbitrary setuid binaries.

These binaries are sometimes useful and some apps may require them. setuid is what allows unprivileged users to execute useful things only traditionally available to superusers, like ping (requires raw socket access), tcpdump (requires promiscuous mode), etc. These will probably not be available in many multitenant container Platforms.

spacer

This all makes me start thinking that we may need a way to specify constraints on container images that particular runtime environments impose. Different Providers will make different choices on what they accept or not, and it may be useful to capture these requirements, or container capabilities as container metadata.

We would need something like a Restrictions API for containers, that could be validated at runtime by Providers, and/or during build by container managers, or during docker push by registries, …

Here are some other examples of requirements that may need to be imposed on containers by some runtime environments:

  • Networking: the number of network interfaces available, which ports are reachable/routable from the outside, how many IP addresses are available, public vs. private IPs, firewall rules, etc.
  • Ephemeral disks: some Providers may not provide persistent disks for containers.
  • Arch, OSes: which architectures are supported (eg.: x86_64, arm).
  • Image size: there probably will be a max amount of disk a container can occupy, or a max size for the container image to be downloaded.

container-rfc is an attempt to define a standard container image format and metadata that would work on multiple container managers.

spacer

Container images, in my experience, have 2-5GB.

Heroku is very dynamic. Containers are constantly being cycled and are always moving. Thousands of containers are being launched per minute. Having to download a 2-5GB image every time a container is being launched does not sound reasonable.

Docker solves this problem with layers.

spacer

Deltas are overlaid on top of a base image. All containers are formed from a hierarchy of read-only layers, and a private writeable layer on top of them. Read-only layers are shared between all containers that use them on a host, and are cached locally.

When many containers use the same base image and share some layers, they are downloaded only once. If providers can restrict the number of base images they support (Restrictions API, again?) known base images can be pre-cached in runtime servers and containers can be launched very quickly, as only the layers (deltas) specific to each app are going to be downloaded.

Even so, Providers will probably need a way to limit the max size of a layer, or a set of layers.

spacer

This works well, until base images (or shared layers) need to be updated. We do that constantly at Heroku, for example with security updates. Every time a base image or shared layer gets updated, all containers using them need to be rebuilt, to pick up the changes.

Heroku wouldn’t be able to quickly respond to security incidents like Heartbleed if we had to rebuild millions of containers every time the base image needs to be updated.

Compare that with what we currently do at Heroku.

spacer

It is a more traditional approach. Similar to how packages are traditionally installed on Operating Systems. Apps are just unpacked on top of a read-only base image, inside containers. The read-only base image is shared (bind mounted) between all containers, and downloaded only once on each box.

When containers are being launched, only the app package (a.k.a. slug) is downloaded.

When the base image needs to be updated, a new version is sent to all runtime instances, and new containers are simply launched on top of a new read-only base image. No rebuild operations are required on any apps.

It is also possible to do that with Docker, but it does not follow the traditional model of shipping containers (docker push + docker pull). The community may come up with good ways to solve this problem in the future, see dotcloud/docker#332 for example.

# idea: make a container point to a new base image?
docker save myapp | docker load --rebase=new-base-image

spacer

Honestly, the whole idea of supporting an arbitrary image format everywhere reminds me a bit of what happened with VM image formats a while ago. Many people were trying to find a common format for VMs that would work on any hypervisor. They failed. Do you remember VMDK vs VHD vs QCow vs QCow2 vs …?

I can’t say yet if the idea of shipping whole containers everywhere is a rabbit hole. Maybe it is, maybe it isn’t. I’ve been experimenting with it and others should do as well.

But let’s take a step back. How about instead of putting restrictions on whole container images and distribute them everywhere, we turn portable apps into containers? We shift the focus back to the app code to be distributed and make sure that apps are portable and can run anywhere, then we map apps to execution runtimes, like Docker containers.

Runtime environments don’t need to be just containers. Portable apps (12factor.net) can then be mapped to raw VMs, or even bare metal servers.

What we are missing in this context is something standard to prepare these portable apps for different runtime environments.

spacer

Buildpacks are a possible way to make this happen. Initially, buildpacks were created to compile (build) 12factor apps into a package that can be executed inside Heroku containers (dynos). But buildpacks are very simple and flexible: just a few executables (bin/detect, bin/compile and bin/release) that are used to build an app for a runtime environment.

Many different PaaS providers adopted buildpacks as their mechanisms to build app code. Each language runtime (Ruby, Python, Node.js, etc.) has a different buildpack.

I propose that we extend the concept of a buildpack: given a base image, it is something that maps (transforms) apps into runtime environments (e.g.: a Docker container) during builds.

One way to implement this idea for Docker, is to make each buildpack a container image that can be used as a parent image by 12factor apps:

$ cat my-portable-app/Dockerfile
FROM heroku/heroku-buildpack-ruby

Buildpack images can then use ONBUILD triggers to do everything required to turn an app into an executable Docker container:

$ cat heroku/heroku-buildpack-ruby/Dockerfile
# Heroku's base image based on Ubuntu 10.04
FROM heroku/cedar

ADD . /buildpack
ONBUILD ADD . /app
ONBUILD RUN /buildpack/bin/compile /app
ONBUILD ENV PORT 5000
ONBUILD EXPOSE 5000

I am exploring heavily with this and I’ve already learned that ONBUILD has limitations (dotcloud/docker#5714). I hope to share more soon.

I am also sure there are other possible ways to use buildpacks to turn apps into executable docker containers. This is just an example (not even complete). Others are doing similar things:

  • Buildstep, from Flynn/Dokku.
  • Google *-runtime images act as buildpack images too, though they don’t use heroku buildpacks – yet :-).
  • Radial tries to map 12factor guidelines to Docker.

I’m not trying to undervalue containers (and Docker!). They are still an amazing way of running apps. But once we shift the focus from “shipping containers” to “shipping apps” again, we open possibilities to run our apps in more runtime environments. As an example, some of my apps have a Makefile to build them locally using a buildpack, so that I can run them locally, on my dev machine (no VMs, no containers, just plain old process execution):

#!/usr/bin/env make -f

buildpath := .build
buildpackpath := $(buildpath)/pack
buildpackcache := $(buildpath)/cache

build: $(buildpackpath)/bin
    $(buildpackpath)/bin/compile . $(buildpackcache)

$(buildpackcache):
    mkdir -p $(buildpath)
    mkdir -p $(buildpackcache)
    curl -O https://codon-buildpacks.s3.amazonaws.com/buildpacks/kr/go.tgz
    mv go.tgz $(buildpath)

$(buildpackpath)/bin: $(buildpackcache)
    mkdir -p $(buildpackpath)
    tar -C $(buildpackpath) -zxf $(buildpath)/go.tgz

make is not the most straightforward thing, but this should be simple enough. make build will build my code using a buildpack. Int his case, the code is written in Go, but it could be any language/framework that has a buildpack. The Makefile just downloads a buildpack and calls bin/compile, the standard buildpack interface to build apps.

This makefile can be seen as a buildpack that maps an app (source code) into a runtime environment (single binary), given a base image (the OS on my laptop).

Here is another example we use at Heroku to run apps in raw machines or VMs (hey, sometimes we need to do it too!):

ruby = "https://codon-buildpacks.s3.amazonaws.com/buildpacks/heroku/ruby.tgz"

app_container "myapp" do
  buildpack ruby
  git_url "git@mycompany.com:myapp.git"
end

define :app_container,
       name: nil,
       buildpack: nil,
       git_url: nil do
  # ...

  execute "#{name} buildpack compile" do
    command "#{dir}/.build/pack/bin/compile #{dir} .build/cache"
  end
end

It is a Chef recipe that builds an app inside a machine using a buildpack, by calling bin/compile. Again!

If an app can be built with a buildpack, then it can potentially run on multiple runtime envinronments, including Docker containers, Heroku, Cloud Foundry, your own machines via Chef recipes, etc.

spacer

My talk was not intended to provide any final answers. In my personal journey, I’ve been bouncing between these two concepts: container centric and app centric, and so far I have found that both have their pros and cons. I’m leaning more towards the app centric model, but I’m biased working as a PaaS provider. Ultimately, I hope we can find something most people are happy with and that everyone can use to run their apps (almost) anywhere.

spacer

Fabio Kung cloud, docker, dockercon, dockercon2014, handout, heroku, lxc, slides, talk 5 Comments

Memory inside Linux containers

Or why don’t free and top work in a Linux container?

Lately at Heroku, we have been trying to find the best way to expose memory usage and limits inside Linux containers. It would be easy to do it in a vendor-specific way: most container specific metrics are available at the cgroup filesystem via /path/to/cgroup/memory.stat, /path/to/cgroup/memory.usage_in_bytes, /path/to/cgroup/memory.limit_in_bytes and others.

An implementation of Linux containers could easily inject one or more of those files inside containers. Here is an hypothetical example of what Heroku, Docker and others could do:

# create a new dyno (container):
$ heroku run bash

# then, inside the dyno:
(dyno) $ cat /sys/fs/cgroup/memory/memory.stat
cache 15582273536
rss 2308546560
mapped_file 275681280
swap 94928896
pgpgin 30203686979
pgpgout 30199319103
# ...

/sys/fs/cgroup/ is the recommended location for cgroup hierarchies, but it is not a standard. If a tool or library is trying to read from it, and be portable across multiple container implementations, it would need to discover the location first by parsing /proc/self/cgroup and /proc/self/mountinfo. Further, /sys/fs/cgroup is just an umbrella for all cgroup hierarchies, there is no recommendation or standard for my own cgroup location. Thinking about it, /sys/fs/cgroup/self would not be a bad idea.

If we decide to go down that path, I would personally prefer to work with the rest of the Linux containers community first and come up with a standard.

I wish it were that simple.

The problem

Most of the Linux tools providing system resource metrics were created before cgroups even existed (e.g.: free and top, both from procps). They usually read memory metrics from the proc filesystem: /proc/meminfo, /proc/vmstat, /proc/PID/smaps