Profiling and Optimizing Machine Learning Model Training With PyTorch

May 24, 2018

There's lots of innovation out there building better machine learning models with new neural net structures, regularization methods, etc. Groups like fast.ai are training complex models quickly on commodity hardware by relying more on "algorithmic creativity" than on overwhelming hardware power, which is good news for those of us without data centers full of hardware. Rather than add to your stash of creative algorithms, this post takes a different (but compatible) approach to better performance. We'll step through measuring and optimizing model training from a systems perspective, which applies regardless of what algorithm you're using. As we'll see, there are nice speedups to be had with little effort. While keeping the same model structure, hyperparameters, etc, training speed improves by 26% by simply moving some work to another CPU core. You might find that you have easy performance gains waiting for you whether you're using an SVM or a neural network in your project.

We'll be using the fast neural style example in PyTorch's example projects as the system to optimize. If you just want to see the few code changes needed, check out this branch. Otherwise, to see how to get there step-by-step so that you can replicate this process on your own projects, read on.

If you want to follow along, start by downloading the 2017 COCO training dataset (18GiB). You'll also need a Linux system with a recent kernel and a GPU (an nVidia one, if you want to use the provided commands as-is). My hardware for this experiment is an i7-6850K with 2x GTX 1070 Ti, though we'll only be using one GPU this time.

Getting started

If you're a virtualenv user, you'll probably want to have a virtualenv with the necessary packages installed:

mkvirtualenv pytorch-examples
workon pytorch-examples
pip install torch torchvision numpy

Clone the pytorch/examples repo and go into the fast_neural_style directory, then start training a model. The batch size is left at the default (4) so it will be easier to replicate these results on smaller hardware, but of course feel free to increase the batch size if you have the hardware. We run date first so we have a timestamp to compare later timestamps with, and pass --cuda 1 so that it will use a GPU. The directory to pass to --dataset should be the one containing the train2017 directory, not the path to train2017 itself.

date ; python neural_style/neural_style.py \
    train --dataset /path/to/coco/ \
     --cuda 1 \
     --save-model-dir model

While that's running for the next few hours, let's dig in to its performance characteristics. vmstat 2 isn't fancy, but it's a good place to start. Unsurprisingly, we have negligible wa (i/o wait) CPU usage and little block i/o, and we're seeing normal user CPU usage for one busy process on a 6-core, 12-thread system (10-12%):

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0      0 978456 1641496 18436400    0    0   298     0 8715 33153 11  3 86  1  0
0      0 977804 1641496 18436136    0    0   256     4 8850 33866 11  3 86  0  0
0      0 966088 1641496 18436136    0    0  1536    12 9793 33106 18  3 79  0  0
0      0 973500 1641496 18436540    0    0   256  2288 9795 36201 12  3 84  1  0
0      0 973576 1641496 18436540    0    0   256     0 8433 32495 10  3 87  0  0

Moving on to the GPU, we'll use nvidia-smi dmon -s u -i 0. dmon periodically outputs GPU info, and we're limiting it to utilization (-s u) and we want the first GPU device (-i 0).

# gpu    sm   mem   enc   dec
# Idx     %     %     %     %
  67    40     0     0
  57    35     0     0
  69    41     0     0
  30    14     0     0
  95    52     0     0
  65    41     0     0
  68    27     0     0
  73    34     0     0

Now this is more interesting. GPU utilization as low as 30%? This workload should basically be waiting on the GPU the entire time, so failing to keep the GPU busy is a problem.

To see if there's something seriously wrong, perf stat is a simple way to get a high-level view of what's going on. Just using 100% of a CPU core doesn't mean much; it could be spending all of its time waiting for memory access or pipeline flushes. We can attach to a running process (that's the -p <pid>) and aggregate performance counters. Note that if you're using a virtual machine, you may not have access to performance counters, and even my physical hardware doesn't support all the counters perf stat looks for, as you can see below.

sudo perf stat -p 15939

After letting that run for a couple of minutes, stopping it with ^C prints a summary:

     167611.675673      task-clock (msec)         #    1.060 CPUs utilized          
            22,843      context-switches          #    0.136 K/sec                  
             1,986      cpu-migrations            #    0.012 K/sec                  
                 7      page-faults               #    0.000 K/sec                  
   641,956,257,119      cycles                    #    3.830 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
   803,246,123,727      instructions              #    1.25  insns per cycle        
   134,551,532,600      branches                  #  802.758 M/sec                  
       943,869,394      branch-misses             #    0.70% of all branches        

     158.057364231 seconds time elapsed

Key points:

1.25 instructions per cycle isn't awful, but it's not great either. In ideal circumstances, CPUs can retire many instructions per cycle, so if that was more like 3 or 4 then that would be a signal that the CPU was already being kept quite busy.
0.7% branch mispredicts is higher than I'd like, but it isn't catastrophic.
Hundreds of context switches per second is not suitable for realtime systems, but is unlikely to affect batch workloads like this one.

Overall, there's nothing obvious like a 5% branch miss rate or 0.5 IPC, so this isn't telling us anything particularly compelling.

Profiling

Capturing performance counter information with perf stat shows that there's some room to improve, but it's not providing any details on what to do about it. perf record, on the other hand, samples the program's stack while it's running, so it will tell us more about what specifically the CPU is spending its time doing.

sudo perf record -p 15939

After letting that run for a few minutes, that will have written a perf.data file. It's not human readable, but perf annotate will output disassembly with the percentage of samples where the CPU was executing that instruction. For my run, the output starts with a disassembly of syscall_return_via_sysret indicating that most of the time there is spent on pop. That's not particularly useful knowledge at the moment (the process makes a good number of syscalls, so we do expect to see some time spent there), so let's keep looking. The next item is for jpeg_idct_islow@@LIBJPEG_9.0, part of PIL (Python Imaging Library, aka Pillow). The output starts with a summary like this:

52 libjpeg-4268453f.so.9.2.0[36635]
93 libjpeg-4268453f.so.9.2.0[3624d]
83 libjpeg-4268453f.so.9.2.0[3646a]
53 libjpeg-4268453f.so.9.2.0[3648b]
44 libjpeg-4268453f.so.9.2.0[36284]
16 libjpeg-4268453f.so.9.2.0[36562]

That continues for about 50 lines or so. That tells us that rather than having a few hot instructions in this function, the cost is smeared out across many instructions (2.52% at offset 36635, 1.93% at 3624d, etc). Paging down to the disassembly, we find lots of this kind of info:

    0.34 :        35ec0:       cltq   
    0.11 :        35ec2:       mov    %rax,-0x40(%rbp)
    0.14 :        35ec6:       shlq   $0xd,-0x38(%rbp)
 libjpeg-4268453f.so.9.2.0[35ecb]    0.72 :       35ecb:       shlq   $0xd,-0x40(%rbp)
 libjpeg-4268453f.so.9.2.0[35ed0]    0.82 :       35ed0:       addq   $0x400,-0x38(%rbp)
 libjpeg-4268453f.so.9.2.0[35ed8]    0.94 :       35ed8:       mov    -0x40(%rbp),%rax
    0.30 :        35edc:       mov    -0x38(%rbp),%rdx
 libjpeg-4268453f.so.9.2.0[35ee0]    0.83 :       35ee0:       add    %rdx,%rax
    0.30 :        35ee3:       mov    %rax,-0x48(%rbp)
    0.28 :        35ee7:       mov    -0x40(%rbp),%rax
    0.00 :        35eeb:       mov    -0x38(%rbp),%rdx
    0.00 :        35eef:       sub    %rax,%rdx

You can see the percentage of samples in the left column (sometimes prefixed with the symbol name for extra busy samples) and offset. This snippet is telling us that 0.72% of cycles were spent on shlq (shift left), 0.82% on addq (integer addition), etc. Note that due to a phenomenon called skid the usage per instruction may be incorrect by several or even dozens of instructions, so these percentage numbers should not be taken as gospel. In this case, for instance, it's unlikely that the two shlq instructions at 35ec6 and 35ecb are actually 5x different.

The next section of perf annotate output is of __vdso_clock_gettime@@LINUX_2.6. VDSO is a way to speed up certain syscalls, notably gettimeofday. Not much to see here, other than to note that maybe we shouldn't be calling gettimeofday(2) as much.

The next section is of ImagingResampleHorizontal_8bpc@@Base from _imaging.cpython-35m-x86_64-linux-gnu.so, which has a large block of fairly hot instructions like this:

 _imaging.cpython-35m-x86_64-linux-gnu.so[21b3b]    3.39 :        21b3b:       imul   %ecx,%esi
 _imaging.cpython-35m-x86_64-linux-gnu.so[21b3e]    3.81 :        21b3e:       add    %esi,%r11d
 _imaging.cpython-35m-x86_64-linux-gnu.so[21b41]   10.14 :        21b41:       movzbl 0x1(%r8,%rdx,1>
 _imaging.cpython-35m-x86_64-linux-gnu.so[21b47]    2.50 :        21b47:       movzbl 0x2(%r8,%rdx,1>
 _imaging.cpython-35m-x86_64-linux-gnu.so[21b4d]    3.23 :        21b4d:       imul   %ecx,%esi

That's a pretty hot movzbl (zero-expand 1 byte into 4 bytes). At this point, we have a hypothesis: we're spending a lot of time decoding and scaling images.

Flame graphs

To get a clearer view of what paths through the program are the most relevant, we'll use a flame graph. There are other things we could do with perf like perf report, but flame graphs are easier to understand in my experience. If you haven't worked with flame graphs before, the general idea is that width = percentage of samples and height = call stack. As an example, a tall, skinny column means a deep call stack that doesn't use much CPU time, while a wide, short column means a shallow call stack that uses a lot of CPU.

Clone the Pyflame repo and follow their compile instructions. There's no need to install it anywhere (the make install step) -- just having a compiled pyflame binary in the build output is sufficient.

As with the other tools, we attach the compiled pyflame binary to the running process and let it run for 10 minutes to get good fidelity:

./src/pyflame -s 600 -p 15939 -o neural-style-baseline.pyflame

In the meantime, clone FlameGraph so we can render pyflame's output as an SVG. From the FlameGraph repo:

./flamegraph.pl path/to/neural-style-baseline.pyflame > neural-style-baseline.svg

The resulting flamegraph looks like this, which you'll probably want to open in a separate tab and zoom in on (the SVG has helpful mouseover info):

Most of the time is idle, which isn't interesting in this case. Back to pyflame, this time with -x to exclude idle time:

./src/pyflame -s 600 -p 15939 -x -o neural-style-baseline-noidle.pyflame

That's much easier to see. If you spend some time mousing around the left third or so of the graph, you'll find that a significant amount of execution time is spent on image decoding and manipulation, starting with neural_style.py:67, which is this:

for batch_id, (x, _) in enumerate(train_loader):

Put another way, it's spending enough time decoding images that it's a significant part of the overall execution, and it's all clumped together in one place rather than being spread across the whole program. In a way, that's good news, because that's a problem we can pretty easily do something about. There's still the other 2/3rds of the graph that we'd love to optimize away too, but that's mostly running the PyTorch model, and improving that will require much more surgery. There is a surprising amount of time spent in vgg.py's usage of namedtuple, though -- that might be a good thing to investigate another time. So, let's work on that first third of the graph: loading the training data.

Adding some parallelism

By now that python command has probably been running for a good long while. It helpfully outputs timestamps every 2000 iterations, and comparing the 20,000 iteration timestamp with the start timestamp I get 21m 40s elapsed. We'll use this duration later.

It's unlikely that we'll make huge strides in JPEG decoding, as that code is already written in a low level language and reasonably well optimized. What we can do, though, is move the CPU-intensive work of image decoding onto another core. Normally this would be a good place to use threads, but Python as a language and CPython as a runtime are not very well suited to multithreading. We can use another Process to avoid the GIL (Global Interpreter Lock) and lack of memory model, though, and even though we'll have more overhead between processes than we would between threads, it should still be a net win. Conveniently, the work we want to execute concurrently is already small and fairly isolated, so it should be easy to move to another process. The training data DataLoader is set up in neural_style.py with

train_dataset = datasets.ImageFolder(args.dataset, transform)
train_loader = DataLoader(train_dataset, batch_size=args.batch_size)

and used with

for batch_id, (x, _) in enumerate(train_loader):

So, all we need to do is move the loading to another process. We can do this with a Queue (actually, one of PyTorch's wrappers). Instead of enumerating train_loader in the main process, we'll have another process do that so that all the image decoding can happen on another core, and we'll receive the decoded data in the main process. To make it easy to enumerate a Queue, we'll start with a helper to make a Queue be iterable:

class QueueIterator:
    """
    An Iterator over a Queue's contents that stops when it receives a None
    """
    def __init__(self, q: Queue):
        self.q = q

    def __iter__(self):
        return self

    def __next__(self):
        thing = self.q.get()

        if thing is None:
            raise StopIteration()

        return thing

We're making the simple assumption that if we pop None off the queue, the iteration is complete. Next, we need a function to invoke in another process that will populate the queue. This is made a bit more complicated by the fact that we can't spawn a process at the intuitive point in the training for loop. If we create a new Process where we enumerate over train_loader (as in the code snippet above), the child process hangs immediately and never is able to populate the queue. The PyTorch docs warn that about such issues, but unfortunately using torch.multiprocessing's wrappers or SimpleQueue did not help. Getting to the root cause of that problem will be a task for another day, but it's simple enough to rearrange the code to avoid the problem: fork a worker process earlier, and re-use it across multiple iterations. To do this, we use another Queue as a simple communication mechanism, which is control_queue below. The usage is pretty basic: sending True through control_queue tells the worker to enumerate the loader and populate batch_queue, finishing with a None to signal completion to the QueueIterator on the other end, while sending False tells the worker that its job is done and it can end its loop (and therefore exit).

def enqueue_loader_output(batch_queue: Queue, control_queue: Queue, loader):
    while True:
        ctrl = control_queue.get()
        if ctrl is False:
            break
        
        for batch_id, (x, _) in enumerate(loader):
            batch_queue.put((batch_id, x))
        batch_queue.put(None)

Now we have everything we need to wire it all together. Before the training loop, make some queues and fork a process:

batch_queue = Queue(100)
control_queue = Queue()
Process(target=enqueue_loader_output, args=(batch_queue, control_queue, train_loader)).start()

And then iterate in the training loop with a QueueIterator:

for batch_id, x in QueueIterator(batch_queue):

Measuring

After stopping the previous training invocation and starting a new one, we can immediately see a good change in nvidia-smi dmon:

# gpu    sm   mem   enc   dec
# Idx     %     %     %     %
  97    44     0     0
  97    45     0     0
  99    50     0     0
  97    49     0     0
  97    49     0     0
  97    49     0     0
  97    47     0     0
  97    48     0     0
  99    50     0     0
  97    55     0     0
  97    56     0     0

The GPU is staying fully utilized. Here's the master process:

What was previously the largest chunk of work in the flame graph is now the small peak on the left. Training logic dominates the graph, instead of image decoding, and namedtuple looks like increasingly low-hanging fruit at 13% of samples...

And the worker that loads training images:

It's spending most of its time doing image decoding. There's some CPU spent in Python's Queue implementation, but the worker sits at about 40% CPU usage total anyway, so the inter-process communication isn't a major bottleneck in this case.

More importantly, for our "time until 20,000 iterations" measurement, that improves from 21m40s to 17m9s, or about a 26% improvement in iterations/sec (15.4 to 19.4). Not bad for just a few lines of straightforward code.

Rust and WebAssembly With Turtle

Jan 08, 2018

In this post, I'll walk through a few of the highlights of getting Turtle, a Rust library for creating animated drawings, to run in the browser with WebAssembly.

Rust has recently gained built-in support for compiling to WebAssembly. While it's currently considered experimental and has some rough edges, it's not too early to start seeing what it can do with some demos. After an experiment running rust-base64 via WebAssembly with surprisingly good performance, I saw this tweet suggesting that someone port Turtle to run in the browser, and figured I'd give it a try (see PR). Turtle is a more interesting port than rust-base64's low level functionality: it's larger, and has dependencies on concepts that have no clear web analog, like forking processes and desktop windows. You can run it yourself with any recent browser; just follow these instructions.

Why Rust + WebAssembly?

See the official WebAssembly site for more detail, but in essence WebAssembly is an escape hatch from the misery of JavaScript as the sole language runtime available in browsers. Even though it's already possible to transpile to JavaScript with tools like emscripten or languages like TypeScript, that doesn't solve problems like JavaScript's lack of proper numeric types. WebAssembly provides a much better target for web-deployed code than JavaScript: it's compact, fast to parse, and JIT compiler friendly.

WebAssembly is not competition for JavaScript directly. It's more akin to JVM bytecode. This makes it a great fit for Rust, though, which has a minimal runtime and no garbage collector.

Turtle

Turtle is a library for making little single-purpose applications that draw graphics ala the Logo language. See below for a simple program (the snowflake example in the Turtle repo) that's most of the way through its drawing logic.

Turtle's existing architecture

Turtle is designed to be friendly to beginner programmers, so unlike most graphics-related code, it's not based around the concept of an explicit render loop. Instead, Turtle programs simply take over the main thread and run straightforward-looking code like this (which draws a circle):

for _ in 0..360 {
    // Move forward three steps
    turtle.forward(3.0);
    // Rotate to the right (clockwise) by 1 degree
    turtle.right(1.0);
}

The small pauses needed for animation are implicit, and programmers can use local variables in a natural way to manage state. This has some implications for running in the browser, as we'll see.

Compiling to `wasm32`

As with any wasm project, the first step is to get the toolchain:

rustup update
rustup target add wasm32-unknown-unknown --toolchain nightly

Hypothetically, this would be all that's needed to compile the example turtle programs:

cargo +nightly build --target=wasm32-unknown-unknown --examples

However, there were naturally some dependencies that didn't play nice with the wasm32 target, mainly involving Piston, the graphics library used internally. Fortunately, Piston's various crates are nicely broken down to separate the platform-specific parts from platform-agnostic things like math helper functions, so introducing a canvas feature to control Piston dependencies (among other things) was all that was needed. That brings us to:

cargo +nightly build --no-default-features --features=canvas --target=wasm32-unknown-unknown --release --examples

Control flow

The current architecture of a Turtle program running on a desktop uses two processes: one to run the user's logic, and another to handle rendering. When the program starts, it first forks a render process, then passes control flow to the user's logic in the parent process. When code like turtle.forward() runs, that ends up passing drawing commands (and other communication) via stdin/stdout to the child rendering process, which uses those to draw into a Piston window. The two-process model allows graphics rendering to be done by the main thread (which is required on macOS) while still allowing users to simply write a main() function and not worry about how to run their code in another thread.

Turtle code doesn't let go of control flow until the drawing has completed. This isn't a problem when Turtle code is running in one process and feeding commands to another process to be rendered to the screen, but browser-resident animation really wants to run code that runs a little bit at a time in an event loop via requestAnimationFrame(). The browser doesn't update the page until control flow for your code has exited. This has some upsides for browsers (no concerns about thread safety when all code is time-sliced into one thread, for instance), but it in Turtle's case, it means that running in the main thread would update the screen exactly once: when the drawing was complete.

We do have one trick up our sleeve, though: Web Workers. These allow simple parallelism, but there is no good way to share memory with a worker and the main thread. (Aha, you say! What about SharedArrayBuffer? Nope, it's been disabled due to Spectre.) What we can do, though, is send messages to the main thread with postMessage(). This way, we can create a worker for Turtle, let it take over that thread entirely, and periodically post messages to be sent to the main thread when it wants to update the UI.

Drawing pixels

Let's consider a very simple Turtle program where the user wishes to draw a line. This boils down to turtle.forward(10) or equivalent. However, this isn't going to get all 10 pixels drawn all at once, because the turtle has a speed. So, it will calculate how long it should take to draw 10 pixels, and draw intermediate versions as time goes on. The same applies when drawing a polygon with a fill: the fill is partially applied at each point in the animation as the edges of the polygon are drawn. As parts of the drawing accumulate over time as the animation continues, the Renderer keeps track of all of them and re-draws them with its Piston Graphics backend every frame.

Naturally, the existing graphics stack that uses piston_window to draw graphics into a desktop app's window isn't going to seamlessly just start running in the browser and render into a <canvas>. So, where should the line be drawn about what is desktop-specific (and will therefore have to be reimplemented for the web)?

One option would be to reimplement the high level turtle interface (forward(), etc) in terms of Canvas drawing calls like beginPath(). While it certainly can be done, it would require reimplementing a lot of code that animates, manages partially-completed polygons, draws the turtle itself, etc. The messages sent to the main thread would likely be a a JS array of closures which, when executed in the appropriate context, would update the <canvas> in the web page. Alternately, the canvas commands to run could be expressed as Objects with various fields, etc. This saves us from worrying about low level pixel wrangling, but at the cost of creating two parallel implementations of some nontrivial logic, one of which is quite difficult to test or reason about in isolation. It also means that as the drawing gets more complex over time, the list of commands to run grows as well, so performance will decay over time. It also involves many small allocations (and consequently more load on the JS GC).

Another approach would be to implement the Piston Graphics trait in terms of the Canvas API. This entails a lot less code duplication because things like animation, drawing the turtle "head" triangle, etc can all be re-used, but it still suffers from the performance decay over time, and has a fair amount of logic that needs to be built in JS, which I was trying to avoid. It's definitely much more practical than the previous option, though.

The approach I chose was to implement Graphics to write to an RGBA pixel buffer, which is easy to then display with a <canvas>. This has a minimal amount of JS needed, and has heavy, but consistent, allocation and message passing cost. Each frame requires one (large) allocation to copy the entire pixel buffer, which is then sent to the main thread and used to update the <canvas>. Obviously there are various ways to optimize this (only sending changed sub-sequences of pixels, etc) but for a first step, sending the entire buffer was functional and reliable. For Turtle's needs, Graphics only needs to draw triangles with a solid fill, which is pretty straightforward.

Passing control flow to Turtle

Getting from main() to Turtle logic is actually a little complicated even when running on a desktop, as mentioned above. It is undesirable to have users deal with running turtle code in a different thread, so instead while Turtle is initializing it does its best to fork at the right time. If users are doing nontrivial initialization code of their own (e.g. parsing command line arguments), they have to be careful to let Turtle fork at the right point.

When running as wasm inside a Worker, initialization is convoluted in a different way. When writing a main() function for a normal OS, all the context you might want is available via environment variables, command line parameters, or of course making syscalls, but there's nothing like that for wasm unless you build it yourself. While you can declare a start function for a wasm module, that doesn't solve our problem since this wasm code depends on knowledge from its environment, like how many pixels are in the canvas it will be drawing to.

While I could have doubled down on "call Turtle::new() at the right point and hope that it will figure it out" by having some wasm-only logic that called into the JS runtime to figure out what it needed, a small macro seemed like a tidier solution. The macro uses conditional compilation to do the right thing when it's compiled for the desktop or for wasm, with straightforward, global-free code in each case.

Before:

fn main() {
    // Must start by calling Turtle::new()
    // Secretly forks a child process and takes over stdout
    let mut turtle = Turtle::new();
    // write your drawing code here
}

After:

run_turtle!(|mut turtle| {
    // write your drawing code here
});

It's harder for users to screw up, and it opens the way to putting the Turtle logic in its own thread (instead of forking) when running on a desktop. This would allow users to use stdout on the desktop (currently claimed for communication with the child process), and the underpinnings could simply use shared memory instead of doing message passing via json over stdin/stdout. And, of course, it lets us get the right wasm-only initialization code in there too!

Anyway, with that in place, initializing the Worker that hosts the wasm is simple enough (see worker.js). Following Geoffroy Couprie's canvas example, we simply allocate a buffer with 4 bytes (RGBA) per pixel and call this function provided by the run_turtle! macro to kick things off:

#[allow(dead_code)]
#[cfg(feature = "canvas")]
#[no_mangle]
pub extern "C" fn web_turtle_start(pointer: *mut u8, width: usize, height: usize) {
    turtle::start_web(pointer, width, height, $f);
}

When a new frame has been rendered, the Rust logic calls into JS, which copies the contents of the pixel buffer and sends it to the main thread, which simply drops it into the <canvas>.

const pixelArray = new Uint8ClampedArray(wasmTurtle.memory.buffer, pointer, byteSize);
const copy = Uint8ClampedArray.from(pixelArray);

postMessage({
    'type':   'updateCanvas',
    'pixels': copy
});

Random number generation

When Rust is running on a normal OS, it uses /dev/random or equivalent internally to seed a PRNG. However, WebAssembly isn't an OS, so it doesn't offer a source of random numbers (or files, for that matter, in the same way that x86 asm doesn't have files). In practice, this means that Turtle's use of rand::thread_rng() dies with an "unreachable code" error.

Fortunately, implementing our own Rng is pretty straightforward: the only required method is next_u32(), and that much we can do with JavaScript and the browser's PRNG (32 bit ints are pretty sane in JS; major int weirdness starts at 2^52). First, we'll need a JS function to make a random number between 0 and u32::max_value(), using the normal random range idiom:

web_prng: () => {
    const min = 0;
    const max = 4294967295; // u32 max

    return Math.floor(Math.random() * (max - min + 1)) + min;
}

By making that available to the wasm module when it's initialized, we can then make that callable from Rust:

extern "C" {
    fn web_prng() -> u32;
}

And then use it in our Rng implementation:

pub struct WebRng;

impl ::rand::Rng for WebRng {
    fn next_u32(&mut self) -> u32 {
        unsafe {
            web_prng()
        }
    }
}

All that remained was to adjust how Turtle exposed random numbers so that it would use the appropriate Rng implementation on a desktop OS vs on wasm.

Conclusion

Hopefully this has demonstrated that getting Rust code running in the browser via wasm is pretty achievable even for projects that aren't a drop-in fit for the browser's runtime model.

There were other problems to solve (like measuring time, or being able to println for debugging), but this post is already pretty long, so I'll refer you to the pull request if you want all the details.

DIY Bike Hoist

Jan 21, 2016

It's surprisingly easy to end up with a multi-bike household, and inevitably some of those bikes need to be stored somewhere out of the way. This guide will show you how to build a strong and safe bike hoist that makes it easy to lift your bikes overhead when you're not using them.

There are a few existing retail options like this one or Harbor Freight's amazingly cheap one, and some other DIY guides. However, none of these suited my requirements:

I have exposed beams in my ceiling, and I wanted the hoist to be able to sit up between the beams so the bikes could be up higher. Hoists that expect a flat ceiling are out.
I want to hold the bike up in such a way that it can't be easily knocked down. I will have multiple bikes side by side, so I don't want one to knock the other off as I move them about, and I also live in an earthquake-prone part of the world. The exising bike hoist kits used hooks that could let the bike fall if bumped.
I want to suspend the bikes in a way that won't harm the bikes. Seat and handlebars are fine, and so are wheels, but some bike storage options have the bike suspended by its top tube like these abominations, which is a no-go. Read your bike manual; I bet it says not to do that.
I want to be quite confident in the strength of the system since I'll be walking beneath it, so I want more than stamped sheet metal parts.

I settled on the following layout:

To build this, you'll need the following:

3x ceiling/wall-mount pulleys. These will be screwed in place. I used 1 1/2" pulleys.
2x regular pulleys. These will be hanging from ropes.
1x fair-sized screw eye. This will be where one end of the rope is tied off.
9x #14 screws, 3 for each ceiling pulley. Since the head will be tight against the pulley body, try to get a "cheese head", "pan head", or "button head" screw, as opposed to countersunk or "flat head". Countersunk screws will be strong enough and all that; it just won't look as tidy. They should be as long as possible given how much wood you have available, e.g. a 3" screw if you're going into a 2x4, plus drywall, etc. #14 is what fits into the pulleys I've linked to, so if you get different pulleys just use the largest size that fits.
1x 8" cleat. This will be where the rope is tied and untied when raising or lowering the bike.
2x #14 screws for the cleat, as long as will fit into whatever wood you are screwing into. Make sure to take into account the height of the cleat, drywall, etc. The cleats I have have countersunk holes, so countersunk screws are fine here.
2x carabiners. Plain old oval non-locking carabiners work great. I wouldn't use the toy carabiners that come with water bottles and such; if you're trying to save a buck and compromise safety this isn't the project for you.
2x 30cm nylon climbing runners. Nylon is soft and won't chew up your bike wheels, which might be carbon or something similarly delicate.
A few feet of some sturdy cord like paracord for tying the carabiners to the pulleys if the carabiner won't fit through the pulley hole.
Rope. I used some old climbing rope. How much depends on how high you need to raise the bike, how far apart you place the pulleys, etc. 40' is likely to be enough.

You'll also want some tools:

Drill and drill bit (I used 1/8" for #14 screws) for predrilling holes for screws
Measuring tape and a level
Impact driver or other power screwdriver
Hammer and punch, nail, awl, or other way of making a small pilot hole so you can predrill without the drill bit "wandering"
Ladder
Lighter, heat gun, or other way of melting the ends of cut synthetic rope

Once you have all that, you're ready to begin. First, assemble the part that will actually be hooked to your bike. If your carabiners won't fit through the hole in the pulley, tie them together with some cord. I used paracord with a fisherman's bend. Since we're trying to minimize vertical space, I wrapped the cord around an extra time so that whatever slack I introduced while tying would not enlarge the loop so much. Melt the cut ends of the paracord so they don't fray. Clip a runner through each carabiner.

You'll need to take the wheel out of the ceiling pulleys to be able to mount them, so remove the cotter pin and pull out the axle.

Map out where you want to put your pulleys. Use the tape and level if you're not mounting to a flat ceiling to make sure you're not laying it out crooked. I used a pencil to trace the outline of each of the three holes in the pulley frame while holding it in place, then used a pin punch and hammer to make a mark in the center of my pencil circle. You can then predrill for your screws, using the small indent from the punch/awl/whatever to keep the drill point steady when it starts the hole. Don't forget to predrill for the screw eye too; it might need a different size drill bit.

On the side where you'll place the cleat, attach two pulleys with screws. Put the axle, wheel, and cotter pin back in once the frame is screwed on. In my case, the cleat isn't directly below the pulleys, so I've mounted them with the bulkier part of the frame facing the cleat so that when the rope is angled down to the cleat it won't tend to shift off of the pulley.

On the other side, mount the remaining pulley and the screw eye. Stick a screwdriver or other strong metal bar through the screw eye and use that to get leverage as you turn it.

As you install the pulleys and the screw eye, loop your rope over each one and pull down hard to make sure it's securely mounted. Make sure to climb off the ladder first before you test!

Pick a place for the cleat to go. Somewhere between waist and head high is probably good, as you'll want to be able to comfortably apply a little force as you work with the rope. Hold the cleat in place, mark with the pencil, and punch two more starter holes. The guys at West Marine told me to put some wood glue on the threads of the screws to help prevent them from backing out. Probably overkill for the light load of 20lbs of bicycle and pulleys, but I had some wood glue anyway so I did that too. Anyway, predrill and then screw in your cleat.

I've got two of these setups, hence the two cleats.

Now you're ready to start assembling the whole thing. Tie off one end of your rope on the screw eye. I first used a figure-8 knot, but then settled on a bowline because it allowed me to get a smaller knot that was closer to the screw eye, allowing me to pull the bike up higher because there was less knot in the way. The bowline, though compact, can slip some when not loaded, so leave a bit of rope on the tail end. I stuck a stopper knot on there too just in case. It's not much of a concern because there will always be at least the pulley's weight keeping the knot a little loaded.

Now that you have something to pull against, uncoil the rest of your rope and let it untwist if it needs to. If your rope still has some twist in it from the way it was stored, it will cause the pairs of rope that go down to the hanging pulleys to twist on themselves.

Work the free end of your rope through a hanging pulley, back up to the mounted pulley next to the screw eye, over the top to a mounted pulley on the other side, down to the second hanging pulley, back up and over the remaining mounted pulley, and down to the cleat. The cleat has a central hole, so run the rope around one of the sides and through that, as this will give you the ability to tie a stopper knot to prevent the rope from coming all the way out of the cleat.

Feed rope into the system so that the pulleys, carabiners, etc are about bike-height. Flip a bike over onto its seat and handlebars, then work the free loop of the runners around the rim and clip the loop into the carabiner.

Pull the rope so that the bike is at the height you want when you load and unload and tie a stopper knot right at the cleat. The bike should now hang by itself against the stopper knot in the cleat hole.

Pull the rope so that the system goes up. You may need to pause and push up one side or another. Because of friction in the pulleys, the end closest to the cleat will probably go up first, but you can simply push up the other end by hand and it will equalize. Once you've got it as high as you want, tie a few figure 8s around the cleat, finishing with one that pinches the free end under the loop (see the picture above of the cleats). Done!

Developers Debate Unimportant Things

Mar 29, 2015

The other day I read a disappointing screed by a determined developer. The specific text is unimportant, but I'm sure you've read plenty just like it: claims about technology X that the author doesn't like, while simultaneously promoting technology Y that they do like as a technological panacea. Now you know why the specific article is irrelevant -- it could be Sass vs Less, or CoffeeScript vs Dart, or Maven vs Gradle, or K&R vs BSD curly brace placement, and any of those hypothetical articles could be extracted from another with find and replace.

Consider the case of a hypothetical team with a background in technology X that starts a new project in (or migrates to) technology Y. The new work goes well, and the team is enthused about Y. The question is, though, how important is Y specifically to the success of the project? Sometimes, of course, there are cases where technology choice can be a huge factor. Trying to write firmware for an embedded device with an 8K ROM in C++ template metaprogramming (it's Turing-complete, so why not?) is probably not going to go well. I'll set aside these hyperbolic cases, though, as they're rare and easily avoided with a little googling. Let's go back to our hypothetical team and consider the contributions of the following aspects of their successful project:

Choosing Y instead of sticking with X or choosing another option Z
The architectural cleanliness that comes from an at least partially blank slate
Having team members who care about improvement
Having team members who know that improvement is possible
Engineering leadership's trust in the team's decisions on what to do
The organizational health needed to work around a departure from the business-as-usual schedule
The corporate political bonhomie needed to allow the engineering organization to innovate rather than take the safe road of continuing with what they already have

I assert that in the common case, the success or failure of a project generally has little to do with the specific technology used. In one way, this is not a radical assertion: the importance of organizational structure, team culture, and other such intangibles has been known for decades. (This is what management is all about, after all.) My point is that even though technology usually isn't the important part, developers argue as if it is. Developers are generally detail oriented, and many (though not all) are also equipped with a surfeit of opinions. Consider topics like compile time vs runtime type checking, Java vs Scala, tabs vs spaces... We can quantify these topics, so we can bring facts (or at least aesthetics) to bear as we bicker over relative merits. My own experience is, of course, only anecdotal evidence, but across the projects I've been involved in both as a regular employee and as a consultant, the choice of technology hasn't been nearly as important to overall productivity as the team's engineering maturity, organizational health, etc. In conversation with fellow technologists, I've found I'm not the only one.

It's understandable that we might tend to attach too much importance to the things that we know well and have some control over, so where do we go from here? I definitely don't want to write off the value of a good ol' my-type-inference-is-better-than-your-type-inference debate; we all learn a lot from the exchange of ideas. Instead, I think that when we debate such things we have to keep in mind that it's only a small part of success, and that ultimately the specific technology probably doesn't matter much. If you have an opportunity to adopt your technology-du-jour because your team has the political will and organizational freedom to do so, you will have necessarily already succeeded at the hard part: being part of an organization that allows for success in the first place. On the other hand, if you're struggling to get momentum adopting a technology that you're confident will help your project, consider that the tech isn't your problem: it's that you're the only one clamoring for improvement.

Blogging With Grain and S3

Jan 05, 2014

I prefer static site generators when it comes to blogging: they're easy to store in version control, and they're pretty bulletproof security-wise. I'd used Octopress before, as well as plain Jekyll, and though I liked the concept, in practice neither worked smoothly: the whole gem infrastructure is kinda messy, and "watch for changes" mode didn't work reliably. So, when I saw a blurb about Grain, a static site generator written in Groovy, I investigated and was pleased to see that it (1) had an Octopress theme clone for easy blog setup, (2) was written with (IMO) best-in-class tech choices: Groovy, Guice, and Gradle, and (3) had watch-for-changes that actually worked.

Grain

This blog uses the Octopress theme for Grain. I chose to fork (see the varblog branch) the main octopress theme repo so that I could more easily incorporate future improvements, rather than starting a new repo using a released version. Especially as Grain matures, you may wish to just take a released version and go from there, but for now using a fork has been fine, and it's let me easily make pull requests as I make improvements that could be generally useful to other users.

I encourage interested readers to go look at the commits in my fork to see all the setup steps I took, but I'll point out one in particular. My Linux system used Python 3 by default, which wasn't compatible with the version of Pygments bundled with Grain. So, to change it to look for python 2 first, I added the following to my SiteConfig in the features section:

    python {
        cmd_candidates = ['python2']
    }

S3 Hosting

Hosting static output in S3 is pretty common. The speed and reliability of S3 is tough to beat, and even though it's non-free, for most people hosting a blog on S3 will cost less than $1 a month.

I first created an S3 bucket named the same thing as the domain (varblog.org). I enabled static website hosting for the bucket (using index.html as the index document) and set the bucket policy to allow GetObject on every object:

{
    "Version":   "2008-10-17",
    "Id":        "Policy1388973900126",
    "Statement": [
        {
            "Sid":       "Stmt1388973897544",
            "Effect":    "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action":    "s3:GetObject",
            "Resource":  "arn:aws:s3:::varblog.org/*"
        }
    ]
}

I made a Route 53 hosted zone for varblog.org (remember to change your domain's nameservers to be Route 53's nameservers) and set up an alias record for varblog.org to point to the S3 bucket.

Uploading to S3

I created a dedicated IAM user for managing the bucket and gave it this IAM policy:

{
    "Version":   "2012-10-17",
    "Statement": [
        {
            "Sid":      "Stmt1388973098000",
            "Effect":   "Allow",
            "Action":   [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::varblog.org/*"
            ]
        },
        {
            "Sid":      "Stmt1388973135000",
            "Effect":   "Allow",
            "Action":   [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::varblog.org"
            ]
        }
    ]
}

This allows that user to do everything to the varblog.org bucket and its contents, but not to do any other AWS actions. I created an Access Key for the user in the IAM console and used it to configure s3cmd with s3cmd --configure -c ~/.s3cfg-varblog.org. This creates a separate config file which I then reference in the s3_deploy_cmd in SiteConfig.groovy. This way, even though I'm storing an access key & secret unencrypted on the filesystem, the credentials only have limited AWS privileges, and I'm not conflating this s3cmd configuration with other configurations I have. Note that when configuring s3cmd, it will ask if you want to test the configuration. Don't bother, as this test will fail: it tries to list all buckets, but this isn't allowed in the IAM user's policy.

At this point, ./grainw deploy will populate the bucket with the generated contents of the site.

Other stuff

For Google Analytics and Disqus I simply created new sites and plugged in the appropriate ids in SiteConfig. I chose to update the GA snippet template since by default new GA accounts use the "universal" tracker which has a different snippet than good old ga.js. If your GA account is old-school, you should be able to leave the template as-is.

Other than that, all I did was tweak some SASS in theme/sass/custom.

What's with the name?

If you're not a Linux/Unix user, this blog's name will make no sense. Then again, the rest of this post probably didn't either. The /var/log directory is historically where log files have gone on Unix-y systems, and 'blog' is kind of like 'log'. Or, put another way, I thought it was amusing when I registered this domain long, long ago.

/var/blog

by Marshall Pierce

Profiling and Optimizing Machine Learning Model Training With PyTorch

Getting started

Profiling

Flame graphs

Adding some parallelism

Measuring

Rust and WebAssembly With Turtle

Why Rust + WebAssembly?

Turtle

Turtle's existing architecture

Compiling to `wasm32`

Control flow

Drawing pixels

Passing control flow to Turtle

Random number generation

Conclusion

DIY Bike Hoist

Developers Debate Unimportant Things

Blogging With Grain and S3

Grain

S3 Hosting

Uploading to S3

Other stuff

What's with the name?

Getting started

Profiling

Flame graphs

Adding some parallelism

Measuring

Why Rust + WebAssembly?

Turtle

Turtle's existing architecture

Compiling to wasm32

Control flow

Drawing pixels

Passing control flow to Turtle

Random number generation

Conclusion

Grain

S3 Hosting

Uploading to S3

Other stuff

What's with the name?

Compiling to `wasm32`