Technical Diary, Day 2
Yesterday, and the plan for today
Well, I started trying to write about my progress yesterday, and I think it went well. So far so good. I'll try to keep up writing this technical diary for the next two weeks (before I go on vacation). I'll do a retrospective at the end of each week to see if there's anything related to the process I could improve.
I had some problems with my org-mode project which I use to publish these posts, but I figured it out today. Hopefully, publishing these diary entries will be straightforward now.
Work-wise, I figured out how the allocator works and how Troels' fix to the
allocator have improved the benchmarks of my existential loops. However, the
nbody benchmark is still very slow, even though it doesn't have that many
copies, so today I'll try to figure out why. First I'll try to investigate the
debugging output from running nbody on our test data set, in order to get an idea
of how many allocation attempts there are. Then, I'll add some simple profiling
to the allocator to see how many cache misses there are, guess at why they
happen, and try to figure out how to avoid them.
What's wrong with nbody?
I thought I had fixed nbody with a small change in free_list_find, but that
turned out to be false.
A bigger change seemed to do the trick though. As described yesterday, the allocator looks primarily for existing allocations matching the given tag, or variable name. Troels' fix extends that strategy by also matching on allocations with the exact same size, which allows the allocator to reuse more allocations. My idea was to completely disregard tags, and instead rely only on the size information. My algorithm runs through the free list and finds the free allocation with the best matching size and returns it.
There is a downside to this strategy: A program which continuously allocates larger and larger memory blocks using the same tag will never be able to reuse memory blocks, but it would keep all the previous allocations in the free list nonetheless. The previous allocator would remove and free existing allocations using the same tag before actually allocating a new block.
However, the upside is that many of our benchmarks run substantially faster.
/home/jxk588/src/futhark $ ./tools/cmp-bench-json.py bench-master.json bench-just-size.json
futhark-benchmarks/accelerate/canny/canny.fut
data/lena512.in: 1.16x
data/lena256.in: 1.13x
futhark-benchmarks/accelerate/crystal/crystal.fut
#0 ("200i32 30.0f32 5i32 1i32 1.0f32"): 0.97x
#4 ("2000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x
#5 ("4000i32 30.0f32 50i32 1i32 1.0f32"): 1.00x
futhark-benchmarks/accelerate/fft/fft.fut
data/64x256.in: 1.01x
data/128x512.in: 1.15x
data/1024x1024.in: 1.01x
data/512x512.in: 1.05x
data/256x256.in: 1.03x
data/128x128.in: 1.04x
futhark-benchmarks/accelerate/fluid/fluid.fut
benchmarking/medium.in: 1.07x
futhark-benchmarks/accelerate/hashcat/hashcat.fut
rockyou.dataset: 0.98x
futhark-benchmarks/accelerate/kmeans/kmeans.fut
data/k5_n50000.in: 1.00x
data/trivial.in: 1.08x
data/k5_n200000.in: 1.10x
futhark-benchmarks/accelerate/mandelbrot/mandelbrot.fut
#1 ("1000i32 1000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.01x
#3 ("4000i32 4000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x
#2 ("2000i32 2000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x
#0 ("800i32 600i32 -0.7f32 0.0f32 3.067f32 100i32 16.0f..."): 0.99x
#4 ("8000i32 8000i32 -0.7f32 0.0f32 3.067f32 100i32 16...."): 1.00x
futhark-benchmarks/accelerate/nbody/nbody-bh.fut
data/10000-bodies.in: 1.13x
data/100000-bodies.in: 1.06x
data/1000-bodies.in: 1.07x
futhark-benchmarks/accelerate/nbody/nbody.fut
data/10000-bodies.in: 1.32x
data/100000-bodies.in: 1.00x
data/1000-bodies.in: 3.15x
futhark-benchmarks/accelerate/pagerank/pagerank.fut
data/small.in: 1.04x
data/random_medium.in: 1.13x
futhark-benchmarks/accelerate/ray/trace.fut
#0 ("800i32 600i32 100i32 50.0f32 -100.0f32 -700.0f32 1..."): 1.00x
futhark-benchmarks/accelerate/smoothlife/smoothlife.fut
#1 ("256i32"): 1.00x
#2 ("512i32"): 0.99x
#3 ("1024i32"): 1.00x
#0 ("128i32"): 1.01x
futhark-benchmarks/accelerate/tunnel/tunnel.fut
#1 ("10.0f32 1000i32 1000i32"): 1.00x
#4 ("10.0f32 8000i32 8000i32"): 1.00x
#0 ("10.0f32 800i32 600i32"): 1.00x
#2 ("10.0f32 2000i32 2000i32"): 1.00x
#3 ("10.0f32 4000i32 4000i32"): 1.00x
futhark-benchmarks/finpar/LocVolCalib.fut
LocVolCalib-data/small.in: 0.99x
LocVolCalib-data/medium.in: 1.00x
LocVolCalib-data/large.in: 1.00x
futhark-benchmarks/finpar/OptionPricing.fut
OptionPricing-data/medium.in: 1.04x
OptionPricing-data/small.in: 1.05x
OptionPricing-data/large.in: 1.00x
futhark-benchmarks/jgf/crypt/crypt.fut
crypt-data/medium.in: 2.51x
futhark-benchmarks/jgf/crypt/keys.fut
crypt-data/userkey0.txt: 1.05x
futhark-benchmarks/jgf/series/series.fut
data/1000000.in: 1.00x
data/10000.in: 0.99x
data/100000.in: 1.00x
futhark-benchmarks/misc/bfast/bfast-cloudy.fut
data/peru.in: 0.94x
data/sahara-cloudy.in: 0.94x
futhark-benchmarks/misc/bfast/bfast.fut
data/sahara.in: 1.03x
futhark-benchmarks/misc/heston/heston32.fut
data/1062_quotes.in: 0.97x
data/10000_quotes.in: 1.03x
data/100000_quotes.in: 0.99x
futhark-benchmarks/misc/heston/heston64.fut
data/1062_quotes.in: 1.00x
data/10000_quotes.in: 1.00x
data/100000_quotes.in: 1.00x
futhark-benchmarks/misc/knn-by-kdtree/buildKDtree.fut
valid-data/kdtree-ppl-32-m-2097152.in: 1.01x
futhark-benchmarks/misc/radix_sort/radix_sort_blelloch_benchmark.fut
data/radix_sort_100K.in: 0.84x
data/radix_sort_10K.in: 1.11x
data/radix_sort_1M.in: 1.09x
futhark-benchmarks/misc/radix_sort/radix_sort_large.fut
data/radix_sort_100K.in: 1.01x
data/radix_sort_10K.in: 0.86x
data/radix_sort_1M.in: 1.02x
futhark-benchmarks/parboil/histo/histo.fut
data/default.in: 1.35x
data/large.in: 1.33x
futhark-benchmarks/parboil/mri-q/mri-q.fut
data/large.in: 1.04x
data/small.in: 1.03x
futhark-benchmarks/parboil/sgemm/sgemm.fut
data/tiny.in: 1.94x
data/small.in: 1.54x
data/medium.in: 1.11x
futhark-benchmarks/parboil/stencil/stencil.fut
data/default.in: 1.00x
data/small.in: 1.00x
futhark-benchmarks/parboil/tpacf/tpacf.fut
data/large.in: 1.00x
data/small.in: 1.00x
data/medium.in: 1.00x
futhark-benchmarks/pbbs/ray/ray.fut
data/angel.in: 0.99x
data/dragon.in: 1.05x
data/happy.in: 1.02x
futhark-benchmarks/rodinia/backprop/backprop.fut
data/small.in: 1.22x
data/medium.in: 1.16x
futhark-benchmarks/rodinia/bfs/bfs_asympt_ok_but_slow.fut
data/64kn_32e-var-1-256-skew.in: 1.05x
data/512nodes_high_edge_variance.in: 1.03x
data/graph1MW_6.in: 1.02x
data/4096nodes.in: 0.99x
futhark-benchmarks/rodinia/bfs/bfs_filt_padded_fused.fut
data/64kn_32e-var-1-256-skew.in: 0.97x
data/512nodes_high_edge_variance.in: 1.08x
data/graph1MW_6.in: 1.12x
data/4096nodes.in: 0.93x
futhark-benchmarks/rodinia/bfs/bfs_heuristic.fut
data/64kn_32e-var-1-256-skew.in: 1.03x
data/512nodes_high_edge_variance.in: 1.05x
data/graph1MW_6.in: 1.07x
data/4096nodes.in: 1.21x
futhark-benchmarks/rodinia/bfs/bfs_iter_work_ok.fut
data/64kn_32e-var-1-256-skew.in: 1.06x
data/512nodes_high_edge_variance.in: 1.24x
data/graph1MW_6.in: 1.05x
data/4096nodes.in: 1.15x
futhark-benchmarks/rodinia/cfd/cfd.fut
data/fvcorr.domn.193K.toa: 1.00x
data/fvcorr.domn.097K.toa: 1.00x
futhark-benchmarks/rodinia/hotspot/hotspot.fut
data/512.in: 1.04x
data/1024.in: 1.02x
data/64.in: 1.04x
futhark-benchmarks/rodinia/kmeans/kmeans.fut
data/kdd_cup.in: 0.98x
data/100.in: 1.02x
data/204800.in: 1.00x
futhark-benchmarks/rodinia/lavaMD/lavaMD.fut
data/3_boxes.in: 1.40x
data/10_boxes.in: 1.13x
futhark-benchmarks/rodinia/lud/lud.fut
data/512.in: 1.01x
data/64.in: 0.99x
data/256.in: 1.11x
data/16by16.in: 1.01x
data/2048.in: 1.01x
futhark-benchmarks/rodinia/myocyte/myocyte.fut
data/small.in: 0.99x
data/medium.in: 1.00x
futhark-benchmarks/rodinia/nn/nn.fut
data/medium.in: 1.03x
futhark-benchmarks/rodinia/nw/nw.fut
data/large.in: 1.00x
futhark-benchmarks/rodinia/particlefilter/particlefilter.fut
data/128_128_10_image_400000_particles.in: 0.99x
data/128_128_10_image_10000_particles.in: 1.01x
futhark-benchmarks/rodinia/pathfinder/pathfinder.fut
data/medium.in: 0.96x
futhark-benchmarks/rodinia/srad/srad.fut
data/image.in: 1.02x
It sounded like the others like the tradeoff, so hopefully the PR will land soon.
Better debugging information in opencl.h
As part of the process of figuring out why nbody was so slow, I've also added
a bit of debugging output to the allocator. More might be warranted in the
future, but for now, we're mostly interested in when allocations actually
happen.
Issues
- Apparently, specifying that a source block contains futhark code causes the block to not scroll correctly when rendered as HTML.
Tomorrow
- Let's try and get started on the linear scan algorithm. First, we need a way to perform liveness analysis on the explicit memory allocation IR in Futhark.
- I should also start looking at the ICFP paper and artifact I'm supposed to review. I should at least print the article tomorrow, so I can read it tomorrow or the day after.
Longer term
- A more complete rewrite of the dynamic allocator might be warranted.