<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Florent - In Fine - Le Blog</title>
	<atom:link href="https://blog.infine.com/author/fduguetinfine-com/feed" rel="self" type="application/rss+xml" />
	<link>https://blog.infine.com</link>
	<description>Le blog des technos de demain !</description>
	<lastBuildDate>Fri, 19 Nov 2021 09:21:33 +0000</lastBuildDate>
	<language>fr-FR</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.5.7</generator>

<image>
	<url>https://blog.infine.com/wp-content/uploads/2021/03/cropped-vignette-32x32.png</url>
	<title>Florent - In Fine - Le Blog</title>
	<link>https://blog.infine.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Generics and virtual functions</title>
		<link>https://blog.infine.com/generics-and-virtual-functions-3355?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=generics-and-virtual-functions</link>
					<comments>https://blog.infine.com/generics-and-virtual-functions-3355#respond</comments>
		
		<dc:creator><![CDATA[Florent]]></dc:creator>
		<pubDate>Fri, 19 Nov 2021 09:21:32 +0000</pubDate>
				<category><![CDATA[Non classé]]></category>
		<guid isPermaLink="false">https://blog.infine.com/?p=3355</guid>

					<description><![CDATA[<p><span class="rt-reading-time" style="display: block;"><span class="rt-label rt-prefix">Temps de lecture : </span> <span class="rt-time">4</span> <span class="rt-label rt-postfix">min.</span></span> Hybridizer supports&#160;generics&#160;and&#160;virtual functions. These concepts allow writing flexible code with type parameters, defering actual behavior resolution to type instanciation by client code. These are fondamental features of modern languages, easing encapsulation, code factorization and concepts expression. However in C#, type parameters are resolved at runtime, which comes with a significant performance penalty. Hybridizer maps them &#8230;</p>
<p>The post <a href="https://blog.infine.com/generics-and-virtual-functions-3355">Generics and virtual functions</a> first appeared on <a href="https://blog.infine.com">In Fine - Le Blog</a>.</p>]]></description>
										<content:encoded><![CDATA[<span class="rt-reading-time" style="display: block;"><span class="rt-label rt-prefix">Temps de lecture : </span> <span class="rt-time">4</span> <span class="rt-label rt-postfix">min.</span></span>
<p>Hybridizer supports&nbsp;<a href="https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/generics/" target="_blank" rel="noopener">generics</a>&nbsp;and&nbsp;<a href="https://msdn.microsoft.com/en-us/library/aa645767(v=vs.71).aspx" target="_blank" rel="noopener">virtual functions</a>. These concepts allow writing flexible code with type parameters, defering actual behavior resolution to type instanciation by client code. These are fondamental features of modern languages, easing encapsulation, code factorization and concepts expression.</p>



<p>However in C#, type parameters are resolved at runtime, which comes with a significant performance penalty. Hybridizer maps them to&nbsp;<a href="http://www.cplusplus.com/doc/oldtutorial/templates/" target="_blank" rel="noopener">C++ templates</a>, which are resolved at compile time. As such, templates allow inlining and interprocedural optimization as in plain C code. Performance penalty is therefore inexisting.</p>



<p>As an example, we will demonstrate usage of generics on a fun mathematical example : the resolution of heat equation with random walks.</p>



<figure class="wp-block-image"><a href="http://hybridizer.io/wp-content/uploads/2017/07/montecarlo_heat_equation_256x256x1000.png" class="fancyboxgroup" rel="gallery-3355"><img decoding="async" src="http://hybridizer.io/wp-content/uploads/2017/07/montecarlo_heat_equation_256x256x1000.png" alt="" class="wp-image-386"/></a></figure>



<div class="center"></div>



<h3 class="wp-block-heading">Mathematic background</h3>



<p>Given a connected bounded 2D domain&nbsp;<span id="MathJax-Element-1-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-1" class="math"><span id="MathJax-Span-2" class="mrow"><span id="MathJax-Span-3" class="mi">Ω</span></span></span></span>&nbsp;in&nbsp;<span id="MathJax-Element-2-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-4" class="math"><span id="MathJax-Span-5" class="mrow"><span id="MathJax-Span-6" class="msubsup"><span id="MathJax-Span-7" class="texatom"><span id="MathJax-Span-8" class="mrow"><span id="MathJax-Span-9" class="mi">ℝ</span></span></span><span id="MathJax-Span-10" class="mn">2</span></span></span></span></span>, its border&nbsp;<span id="MathJax-Element-3-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-11" class="math"><span id="MathJax-Span-12" class="mrow"><span id="MathJax-Span-13" class="mi">∂</span><span id="MathJax-Span-14" class="mi">Ω</span></span></span></span>&nbsp;and a function&nbsp;<span id="MathJax-Element-4-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-15" class="math"><span id="MathJax-Span-16" class="mrow"><span id="MathJax-Span-17" class="mi">𝑓</span><span id="MathJax-Span-18" class="mo">∈:</span><span id="MathJax-Span-19" class="msubsup"><span id="MathJax-Span-20" class="texatom"><span id="MathJax-Span-21" class="mrow"><span id="MathJax-Span-22" class="mtext">L</span></span></span><span id="MathJax-Span-23" class="mn">2</span></span><span id="MathJax-Span-24" class="mo">(</span><span id="MathJax-Span-25" class="mi">∂</span><span id="MathJax-Span-26" class="mi">Ω</span><span id="MathJax-Span-27" class="mo">)</span></span></span></span>, we search&nbsp;<span id="MathJax-Element-5-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-28" class="math"><span id="MathJax-Span-29" class="mrow"><span id="MathJax-Span-30" class="mi">𝑢</span></span></span></span>&nbsp;such as:</p>



<p class="center"><span id="MathJax-Element-6-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-31" class="math"><span id="MathJax-Span-32" class="mrow"><span id="MathJax-Span-33" class="mtable"><span id="MathJax-Span-34" class="mtd"><span id="MathJax-Span-35" class="mrow"><span id="MathJax-Span-36" class="mi">Δ</span><span id="MathJax-Span-37" class="mi">𝑢</span><span id="MathJax-Span-38" class="mo">=</span><span id="MathJax-Span-39" class="mn">0</span></span></span></span></span></span></span> on Ω</p>



<p class="center"><span id="MathJax-Element-6-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-31" class="math"><span id="MathJax-Span-32" class="mrow"><span id="MathJax-Span-33" class="mtable"><span id="MathJax-Span-48" class="mtd"><span id="MathJax-Span-49" class="mrow"><span id="MathJax-Span-50" class="mi">𝑢</span><span id="MathJax-Span-51" class="mo">=</span><span id="MathJax-Span-52" class="mi">𝑓</span></span></span></span></span></span></span> on ∂Ω</p>



<p class="center">Classic ways to numerically solve this problem involve&nbsp;<a rel="noopener" href="https://en.wikipedia.org/wiki/Finite_element_method" target="_blank">finite elements</a>&nbsp;or similar discretization methods which come with different regularity constraits on&nbsp;<span id="MathJax-Element-7-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-62" class="math"><span id="MathJax-Span-63" class="mrow"><span id="MathJax-Span-64" class="mi">∂</span><span id="MathJax-Span-65" class="mi">Ω</span></span></span></span>.</p>



<p>It happens we can solve it using&nbsp;<a href="https://en.wikipedia.org/wiki/Monte_Carlo_method" target="_blank" rel="noopener">montecarlo methods</a>&nbsp;by launching brownian motions. For each point&nbsp;<span id="MathJax-Element-8-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-66" class="math"><span id="MathJax-Span-67" class="mrow"><span id="MathJax-Span-68" class="mo">(</span><span id="MathJax-Span-69" class="mi">𝑥</span><span id="MathJax-Span-70" class="mo">,</span><span id="MathJax-Span-71" class="mi">𝑦</span><span id="MathJax-Span-72" class="mo">)</span></span></span></span>&nbsp;in&nbsp;<span id="MathJax-Element-9-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-73" class="math"><span id="MathJax-Span-74" class="mrow"><span id="MathJax-Span-75" class="mi">Ω</span></span></span></span>&nbsp;we launch&nbsp;<span id="MathJax-Element-10-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-76" class="math"><span id="MathJax-Span-77" class="mrow"><span id="MathJax-Span-78" class="mi">𝑁</span></span></span></span>&nbsp;random walks. For each random walk, we wait until it reaches&nbsp;<span id="MathJax-Element-11-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-79" class="math"><span id="MathJax-Span-80" class="mrow"><span id="MathJax-Span-81" class="mi">∂</span><span id="MathJax-Span-82" class="mi">Ω</span></span></span></span>&nbsp;and sample the temperature at exit point. We then sum all those exit temperatures and divide by&nbsp;<span id="MathJax-Element-12-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-83" class="math"><span id="MathJax-Span-84" class="mrow"><span id="MathJax-Span-85" class="mi">𝑁</span></span></span></span>&nbsp;to get the numerical solution :</p>



<figure class="wp-block-image"><a href="http://hybridizer.io/wp-content/uploads/2017/07/random_walk_square.png" class="fancyboxgroup" rel="gallery-3355"><img decoding="async" src="http://hybridizer.io/wp-content/uploads/2017/07/random_walk_square.png" alt="" class="wp-image-394"/></a></figure>



<div class="center"></div>



<p>This method is quite slow (compared to finite elements or similar). But it has some advantages:</p>



<ul><li>It can easily be distributed among many threads (all points are independants)</li><li>It’s possible to compute the solution at a specific location</li><li><span id="MathJax-Element-13-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-86" class="math"><span id="MathJax-Span-87" class="mrow"><span id="MathJax-Span-88" class="mi">Ω</span></span></span></span>&nbsp;has almost no regularity constraint (external cone)</li><li>Absolutely no memory footprint (except for the solution)</li><li>It works similarly in higher dimensions</li></ul>



<p>Full explanations can be found on&nbsp;<a href="http://www.altimesh.com/wp-content/uploads/2017/07/brownien.pdf" target="_blank" rel="noopener">this old research report</a>&nbsp;(in french).</p>



<h3 class="wp-block-heading">Code</h3>



<p>We structured our code as would be a generic mathematical solver. The main class is&nbsp;<code>MonteCarloHeatSolver</code>, which takes a&nbsp;<code>I2DProblem</code>&nbsp;to solve it.<br>This code is generic and solves a problem described by an interface:</p>


<div class="wp-block-syntaxhighlighter-code "><pre class="brush: plain; title: ; notranslate">
 public class MonteCarloHeatSolver
 {
  I2DProblem _problem;
  
 public MonteCarloHeatSolver(I2DProblem problem)
 {
  _problem = problem;
 }
  
 &#x5B;EntryPoint]
 public void Solve()
 {
 int stopX = _problem.MaxX();
 int stopY = _problem.MaxY();
 for (int j = 1 + threadIdx.y + blockIdx.y * blockDim.y; j &lt; stopY; j += blockDim.y * gridDim.y) {
 for (int i = 1 + threadIdx.x + blockIdx.x * blockDim.x; i &lt; stopX; i += blockDim.x * gridDim.x) {
  float2 position;
  position.x = i;
  position.y = j;
  _problem.Solve(position);
 }
 }
 }
 } 
</pre></div>


<p><br>where actual resolution (geometry related) is deferred to a&nbsp;<code>I2DProblem</code>. The call to&nbsp;<code>Solve</code>&nbsp;(virtual) won’t be matched to templates. But the Hybridizer will handle this and dispatch this call correctly at runtime. This will cost a vtable lookup, but only once per thread. Performance critical code is in the random walk and the boundary conditions, which are generic parameters of the 2D problem.</p>



<p>An example of C# code instanciation can be:</p>



<pre class="wp-block-code"><code class=""> var problem = new SquareProblem&lt;SimpleWalker, SimpleBoundaryCondition&gt;(N, iterCount);
 var solver = new MonteCarloHeatSolver(problem);
 solver.Solve(); </code></pre>





<p>We then have two interfaces describing random walks and boundary conditions :</p>



<pre class="wp-block-code"><code class=""> [HybridTemplateConcept]
 public interface IRandomWalker
 {
 [Kernel]
 void Init();
 [Kernel]
 void Walk(float2 f, out float2 t);
 }
 [HybridTemplateConcept]
 public interface IBoundaryCondition
 {
 [Kernel]
 float Temperature(float x, float y);
 } </code></pre>



<p><br>These interfaces are decorated with&nbsp;<code>[HybridTemplateConcept]</code>&nbsp;which tells the Hybridizer that these types will be used as type parameters. They can be extended by actual classes such as:</p>



<pre class="wp-block-code"><code class=""> public struct SimpleBoundaryCondition : IBoundaryCondition
 {
 [Kernel]
 public float Temperature(float x, float y)
 {
 if ((x &gt;= 1.0F &amp;&amp; y &gt;= 0.5F) || (x &lt;= 0.0F &amp;&amp; y &lt;= 0.5F))
 return 1.0F;
 return 0.0F;
 }
 } </code></pre>



<p>Generic types using these interfaces have to tell the hybridizer how they want it to generate template code from generics. This is again done by using attributes:</p>



<p>For example:</p>



<pre class="wp-block-code"><code class=""> [HybridRegisterTemplate(Specialize = typeof(SquareProblem&lt;SimpleWalker, SimpleBoundaryCondition&gt;))]
 public class SquareProblem&lt;TRandomWalker, TBoundaryCondition&gt;: I2DProblem 
  where TRandomWalker : struct, IRandomWalker 
  where TBoundaryCondition: struct, IBoundaryCondition
 {
  // other interface methods implementations
  // ...
 &nbsp;
 [Kernel]
 public void Solve(float2 position)
 {
  TRandomWalker walker = default(TRandomWalker);
  TBoundaryCondition boundaryCondition = default(TBoundaryCondition);
  walker.Init(); // generic parameter method call -- will be inlined
 float temperature = 0.0F;
 float size = (float)_N;
 for (int iter = 0; iter &lt; _iter; ++iter)
 {
  float2 f = position;
 &nbsp;
 while (true)
 {
  float2 t;
  walker.Walk(f, out t); // generic parameter method call -- will be inlined
 &nbsp;
  // when on border, break
 if(t.x &lt;= 0.0F || t.y &gt;= size || t.x &gt;= size || t.y &lt;= 0.0F)
 {
  // generic parameter method call -- will be inlined
  temperature += boundaryCondition.Temperature((float)t.x * _h, (float)t.y * _h);
 break;
 }
 &nbsp;
  // otherwise continue walk
  f = t;
 }
 }
 &nbsp;
  _inner[((int)(position.y - 1)) * (_N - 1) + (int)(position.x - 1)] = temperature * _invIter;
 }
 &nbsp;
 } </code></pre>





<pre class="wp-block-code"><code class=""> [HybridRegisterTemplate(Specialize = typeof(TetrisProblem&lt;SimpleWalker, TetrisBoundaryCondition&gt;))]
 public class TetrisProblem&lt;TRandomWalker, TBoundaryCondition&gt; : I2DProblem
  where TRandomWalker : struct, IRandomWalker
  where TBoundaryCondition : struct, IBoundaryCondition
 {
  // actual interface implementation
 } </code></pre>



<p></p>



<h3 class="wp-block-heading">Results</h3>



<p></p>



<h4 class="wp-block-heading">Dispatchant calls</h4>



<p>Virtual functions trigger a&nbsp;<code>callvirt</code>&nbsp;in the MSIL.</p>



<pre class="wp-block-code"><code class="">IL_005e: callvirt instance void MonteCarloHeatEquation.I2DProblem::Solve</code></pre>



<p><br>Profiling the generated code with nvvp shows us that a vtable is generated by the Hybridizer (ensuring the right method is called):</p>



<div class="wp-block-image"><figure class="aligncenter"><a href="http://www.altimesh.com/wp-content/uploads/2017/07/dispatchant_call.png" class="fancyboxgroup" rel="gallery-3355"><img decoding="async" src="http://hybridizer.io/wp-content/uploads/2017/07/dispatchant_call-1024x325.png" alt="" class="wp-image-405"/></a></figure></div>



<div class="center"></div>



<h4 class="wp-block-heading">Generics and templates</h4>



<p>On the other hand,&nbsp;<code>IRandomWalker</code>&nbsp;and&nbsp;<code>IBoundaryCondition</code>&nbsp;type parameters are mapped to templates. Their methods are therefore inlined, as shown in this nvvp profiling:</p>



<div class="wp-block-image"><figure class="aligncenter"><a href="http://hybridizer.io/wp-content/uploads/2017/07/template_call.png" class="fancyboxgroup" rel="gallery-3355"><img decoding="async" src="http://hybridizer.io/wp-content/uploads/2017/07/template_call-1024x374.png" alt="" class="wp-image-406"/></a></figure></div>



<div class="center"></div>



<p><em>By the way: the images above show that your C# code is linked to the sass in the profiler. See our post about&nbsp;<a href="http://www.altimesh.com/debugging-and-profiling/" target="_blank" rel="noopener">debugging and profiling</a></em></p>



<h3 class="wp-block-heading">Conclusion</h3>



<p>With few restrictions, you can safely use generics type parameters and dispatchant calls in your C# code. Hybridizer will map that to the correct concept (vtable or template) as your required for.<br>Dispatchant calls give you full flexibility of an inheritance hierarchy, but come at a performance cost. On the other hand, generics deliver full performance (inlined calls), as long as the right metadata has been provided.</p><p>The post <a href="https://blog.infine.com/generics-and-virtual-functions-3355">Generics and virtual functions</a> first appeared on <a href="https://blog.infine.com">In Fine - Le Blog</a>.</p>]]></content:encoded>
					
					<wfw:commentRss>https://blog.infine.com/generics-and-virtual-functions-3355/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>From C# to SIMD : Numerics.Vector and Hybridizer</title>
		<link>https://blog.infine.com/from-c-to-simd-numerics-vector-and-hybridizer-3339?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=from-c-to-simd-numerics-vector-and-hybridizer</link>
					<comments>https://blog.infine.com/from-c-to-simd-numerics-vector-and-hybridizer-3339#respond</comments>
		
		<dc:creator><![CDATA[Florent]]></dc:creator>
		<pubDate>Wed, 06 Oct 2021 08:49:00 +0000</pubDate>
				<category><![CDATA[C#]]></category>
		<category><![CDATA[Hybridizer]]></category>
		<guid isPermaLink="false">https://blog.infine.com/?p=3339</guid>

					<description><![CDATA[<p><span class="rt-reading-time" style="display: block;"><span class="rt-label rt-prefix">Temps de lecture : </span> <span class="rt-time">5</span> <span class="rt-label rt-postfix">min.</span></span> System.Numerics.Vector&#160;is a library provided by .Net (as a nuget package), which tries to leverage SIMD instruction on target hardware. It exposes a few value types, such as&#160;Vector&#60;T&#62;, which are recognized by&#160;RyuJIT&#160;as intrinsics.Supported intrinsics are listed in the&#160;core-clr github repository.This allows C# SIMD acceleration, as long as code is modified to use these intrinsic types, instead &#8230;</p>
<p>The post <a href="https://blog.infine.com/from-c-to-simd-numerics-vector-and-hybridizer-3339">From C# to SIMD : Numerics.Vector and Hybridizer</a> first appeared on <a href="https://blog.infine.com">In Fine - Le Blog</a>.</p>]]></description>
										<content:encoded><![CDATA[<span class="rt-reading-time" style="display: block;"><span class="rt-label rt-prefix">Temps de lecture : </span> <span class="rt-time">5</span> <span class="rt-label rt-postfix">min.</span></span>
<p><a href="https://msdn.microsoft.com/en-us/library/dn858218(v=vs.111).aspx" target="_blank" rel="noopener">System.Numerics.Vector</a>&nbsp;is a library provided by .Net (as a nuget package), which tries to leverage SIMD instruction on target hardware. It exposes a few value types, such as&nbsp;<code>Vector&lt;T&gt;</code>, which are recognized by&nbsp;<a href="https://blogs.msdn.microsoft.com/dotnet/2013/09/30/ryujit-the-next-generation-jit-compiler-for-net/" target="_blank" rel="noopener">RyuJIT</a>&nbsp;as intrinsics.<br>Supported intrinsics are listed in the&nbsp;<a href="https://raw.githubusercontent.com/dotnet/coreclr/master/src/jit/simdintrinsiclist.h" target="_blank" rel="noopener">core-clr github repository</a>.<br>This allows C# SIMD acceleration, as long as code is modified to use these intrinsic types, instead of scalar floating point elements.</p>



<p>On the other hand, Hybridizer aims to provide those benefits without being intrusive in the code (only metadata is required).</p>



<p>We naturally wanted to test if System.Numerics.Vector delivers good performance, compared to Hybridizer.</p>



<figure class="wp-block-table"><table><tbody><tr><td><strong>Summary</strong><br>We measured that Numerics.Vector provides good speed-up over C# code as long as no transcendental function is involved (such as Math.Exp), but still lags behind Hybridizer. Because of the lack of some operators and mathematical functions, Numerics can also generate really slow code (when AVX pipeline is broken). In addition, code modification is a heavy process, and can’t easily be rolled back.</td></tr></tbody></table></figure>



<p>We wrote and ran two benchmarks, and for each of them we have four versions:</p>



<ul><li>Simple C# scalar code</li><li>Numerics.Vector</li><li>Simple C# scalar code, hybridized</li><li>Numerics.Vector, hybridized</li></ul>



<p>Processor is a&nbsp;<a href="http://ark.intel.com/products/75124/Intel-Core-i7-4770S-Processor-8M-Cache-up-to-3_90-GHz" target="_blank" rel="noopener">core i7-4770S @ 3.1GHz</a>&nbsp;(max measured turbo in AVX mode being 3.5GHz). Peak flops is 224 GFlop/s, or 112 GCFlop/s, if we count&nbsp;<a href="https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation" target="_blank" rel="noopener">FMA&nbsp;</a>as one (since our processor supports it).</p>



<h2 class="wp-block-heading">Compute bound benchmark</h2>



<p>This is a compute-intensive benchmark. For each element of a large double precision array (8 millions elements: 67MBytes), we iterate twelve times the computation of an exponential’s Taylor expansion (expm1). This is largely enough to enter the compute-bound world, by hiding memory operations latency behind a full bunch of floatin point operations.<br>Scalar code is simply:</p>





<pre class="wp-block-code"><code lang="csharp" class="language-csharp"> [MethodImpl(MethodImplOptions.AggressiveInlining)]
 public static double expm1(double x)
 {
   return ((((((((((((((15.0 + x)
     * x + 210.0)
     * x + 2730.0)
     * x + 32760.0)
     * x + 360360.0)
     * x + 3603600.0)
     * x + 32432400.0)
     * x + 259459200.0)
     * x + 1816214400.0)
     * x + 10897286400.0)
     * x + 54486432000.0)
     * x + 217945728000.0)
     * x + 653837184000.0)
     * x + 1307674368000.0)
     * x * 7.6471637318198164759011319857881e-13;
 }
 [MethodImpl(MethodImplOptions.AggressiveInlining)]
 public static double twelve(double x)
 {
   return expm1(expm1(expm1(expm1(expm1(expm1(expm1(expm1(expm1(expm1(expm1(x)))))))))));
 } </code></pre>



<p>on which we added the&nbsp;<a href="https://msdn.microsoft.com/en-us/library/system.runtime.compilerservices.methodimploptions(v=vs.110).aspx" target="_blank" rel="noopener">AggressiveInlining&nbsp;</a>attribute to help RyuJit to merge operations at JIT time.</p>



<p>The Numerics.Vector version of the code is quite the same:</p>



<pre class="wp-block-code"><code lang="csharp" class="language-csharp"> [MethodImpl(MethodImplOptions.AggressiveInlining)]
 public static Vector&lt;double> expm1(Vector&lt;double> x)
 {
   return ((((((((((((((new Vector&lt;double>(15.0) + x)
     * x + new Vector&lt;double>(210.0))
     * x + new Vector&lt;double>(2730.0))
     * x + new Vector&lt;double>(32760.0))
     * x + new Vector&lt;double>(360360.0))
     * x + new Vector&lt;double>(3603600.0))
     * x + new Vector&lt;double>(32432400.0))
     * x + new Vector&lt;double>(259459200.0))
     * x + new Vector&lt;double>(1816214400.0))
     * x + new Vector&lt;double>(10897286400.0))
     * x + new Vector&lt;double>(54486432000.0))
     * x + new Vector&lt;double>(217945728000.0))
     * x + new Vector&lt;double>(653837184000.0))
     * x + new Vector&lt;double>(1307674368000.0))
     * x * new Vector&lt;double>(7.6471637318198164759011319857881e-13);
} </code></pre>



<p>The four versions of this code give the following performance results:</p>



<figure class="wp-block-table"><table><tbody><tr><td>Flavor</td><td>Scalar C#</td><td>Vector C#</td><td>Vector Hyb</td><td>Scalar Hyb</td></tr><tr><td>GCFlop/s</td><td>4.31</td><td>19.95</td><td>41.29</td><td>59.65</td></tr></tbody></table></figure>



<div class="wp-block-image"><figure class="aligncenter"><a href="http://www.altimesh.com/wp-content/uploads/2017/06/expm1-numerics-vector-speedup.png" class="fancyboxgroup" rel="gallery-3339"><img decoding="async" src="http://hybridizer.io/wp-content/uploads/2017/06/expm1-numerics-vector-speedup.png" alt="" class="wp-image-237"/></a></figure></div>



<p>As stated, Numerics.Vector delivers a close to 4x speedup from scalar. However, performance is far from what we reach with the Hybridizer. If we look at generated assembly, it’s quite clear why:</p>



<pre class="wp-block-code"><code lang="c" class="language-c"> vbroadcastsd ymm0,mmword ptr [7FF7C2255B48h]
 vbroadcastsd ymm1,mmword ptr [7FF7C2255B50h]
 vbroadcastsd ymm2,mmword ptr [7FF7C2255B58h]
 vbroadcastsd ymm3,mmword ptr [7FF7C2255B60h]
 vbroadcastsd ymm4,mmword ptr [7FF7C2255B68h]
 vbroadcastsd ymm5,mmword ptr [7FF7C2255B70h]
 vbroadcastsd ymm7,mmword ptr [7FF7C2255B78h]
 vbroadcastsd ymm8,mmword ptr [7FF7C2255B80h]
 vaddpd ymm0,ymm0,ymm6 
 vmulpd ymm0,ymm0,ymm6
 vaddpd ymm0,ymm0,ymm1
 vmulpd ymm0,ymm0,ymm6
 vaddpd ymm0,ymm0,ymm2
 vmulpd ymm0,ymm0,ymm6
 vaddpd ymm0,ymm0,ymm3
 vmulpd ymm0,ymm0,ymm6
 vaddpd ymm0,ymm0,ymm4
 vmulpd ymm0,ymm0,ymm6
 vaddpd ymm0,ymm0,ymm5
 vmulpd ymm0,ymm0,ymm6
 vaddpd ymm0,ymm0,ymm7
 vmulpd ymm0,ymm0,ymm6
 vaddpd ymm0,ymm0,ymm8
 vmulpd ymm0,ymm0,ymm6
 ; repeated </code></pre>



<p>Fused multiply add are not reconstructed, and constant operands are reloaded from constant pool at each expm1 invokation. This leads to high registry pressure (for constants), where memory operands could save some.</p>



<p>Here is what the Hybridizer generates from scalar code:</p>



<pre class="wp-block-code"><code lang="csharp" class="language-csharp"> vaddpd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vfmadd213pd ymm1,ymm0,ymmword ptr []
 vmulpd ymm0,ymm0,ymm1&lt;br /&gt;
 vmulpd ymm0,ymm0,ymmword ptr []
 vmovapd ymmword ptr [rsp+0A20h],ymm0
 ; repeated </code></pre>



<p><br>This reconstructs fused multiply-add, and leverages memory operands to save registers.</p>



<p>Why are we not to peak performance (112GCFlops)? That is because Haswell has two pipelines for FMA, and a latency of 5 (see&nbsp;<a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=FMA&amp;expand=2595,2381" target="_blank" rel="noopener">intel intrinsic guide</a>. To reach peak performance, we would need to interleave 2 independant FMA instruction at each cycle. This could be done by reordering instructions, since&nbsp;<a href="http://www.anandtech.com/show/6355/intels-haswell-architecture/8" target="_blank" rel="noopener">reorder buffer</a>&nbsp;is not long enough to execute instructions too far in the pipeline. LLVM, our backend compiler, is not capable of such reordering. To get better performance, we unfortunately have to write assembly by hand (which is not exactly what a C# programmer expects to do in the morning).</p>



<h2 class="wp-block-heading">Invoke transcendentals</h2>



<p>In this second benchmark, we need to compute the exponential of all the components of a vector. To do that, we invoke&nbsp;<a href="https://msdn.microsoft.com/en-us/library/system.math.exp(v=vs.110).aspx" target="_blank" rel="noopener">Math.Exp</a>.<br>Scalar code is:</p>





<pre class="wp-block-code"><code lang="csharp" class="language-csharp"> [EntryPoint]
 public static void Apply_scal(double[] d, double[] a, double[] b, double[] c, int start, int stop)
 {
   int sstart = start + threadIdx.x + blockDim.x * blockIdx.x;
   int step = blockDim.x * gridDim.x;
   for (int i = sstart; i &lt; stop; i += step)
   {
     d[i] = a[i] * Math.Exp(b[i]) * Math.Exp(c[i]);
   }
 } </code></pre>



<p><br>This function is later called in a&nbsp;<code>Parallel.For</code>&nbsp;construct.</p>



<p>However, Numerics.Vector does not provide a vectorized exponential function. Therefore, we have to write our own:</p>



<pre class="wp-block-code"><code lang="csharp" class="language-csharp"> [IntrinsicFunction("hybridizer::exp")]
 [MethodImpl(MethodImplOptions.AggressiveInlining)]
 public static Vector&lt;double> Exp(Vector&lt;double> x)
 {
   double[] tmp = new double[Vector&lt;double>.Count];
   for(int k = 0; k &lt; Vector&lt;double>.Count; ++k)
   {
     tmp[k] = Math.Exp(x[k]);
   }
   return new Vector&lt;double>(tmp);
 } </code></pre>



<p>As a glance, we can see the problems: each exponential will first break the AVX context (<a href="https://software.intel.com/en-us/articles/avoiding-avx-sse-transition-penalties">which cost tens of cycles</a>), and trigger 4 function calls instead of one.</p>



<p>With no surprise, this code performs really badly:</p>



<figure class="wp-block-table"><table><tbody><tr><td>Flavor</td><td>Scalar C#</td><td>Vector C#</td><td>Vector Hyb</td><td>Scalar Hyb</td></tr><tr><td>GB/s</td><td>13.42</td><td>1.80</td><td>14.91</td><td>14.13</td></tr></tbody></table></figure>



<div class="wp-block-image"><figure class="aligncenter"><a href="http://www.altimesh.com/wp-content/uploads/2017/06/bandwidth-numerics-vector-speedup.png" class="fancyboxgroup" rel="gallery-3339"><img decoding="async" src="http://hybridizer.io/wp-content/uploads/2017/06/bandwidth-numerics-vector-speedup.png" alt="" class="wp-image-242"/></a></figure></div>



<p>If we look at the generated assembly, it confirms what we suspected (context switched, and ymm register splitting):</p>



<pre class="wp-block-code"><code class=""> vextractf128 xmm9,ymm6,1
 vextractf128 xmm10,ymm7,1
 vextractf128 xmm11,ymm8,1
 call 00007FF8127C6B80 // exp
 vinsertf128 ymm8,ymm8,xmm11,1
 vinsertf128 ymm7,ymm7,xmm10,1
 vinsertf128 ymm6,ymm6,xmm9,1 </code></pre>



<h2 class="wp-block-heading">Branching</h2>



<p>Branch are expressed using&nbsp;<code>if</code>&nbsp;or ternary operators in scalar code. However, those are not available in Numerics.Vector, since the code is manually vectorized.<br>Branches must be expressed using&nbsp;<code>ConditionalSelect</code>, which leads to code:</p>



<pre class="wp-block-code"><code lang="csharp" class="language-csharp"> public static Vector&lt;double> func(Vector&lt;double> x)
 {
   Vector&lt;long> mask = Vector.GreaterThan(x, one);
   Vector&lt;double> result = Vector.ConditionalSelect(mask, x, one);
   return result;
 } </code></pre>



<p>As we can see, expressing conditions with Numerics.Vector is not intuitive, intrusive, and bug prone. It’s actually the same as writing AVX compiler intrinsics in C++. On the other hand, Hybridizer supports conditions, which allow you to write the above code this way:</p>



<pre class="wp-block-code"><code lang="csharp" class="language-csharp"> [Kernel]
 public static double func(double x)
 {
   if (x > 1.0)
     return x;
   return 1.0;
 } </code></pre>



<h2 class="wp-block-heading">Conclusion</h2>



<p>Numerics.Vector gives easily reasonable performances on simple code (no branches, no function calls). Speed-up is what we expect (vector unit width) on simple code. However, it’s time-consuming and error-prone to express conditions, and performance is completely broken as soon as some Jitter Intrinsic is missing (such as exponential).</p>



<p></p>



<p class="has-cyan-bluish-gray-color has-text-color"></p><p>The post <a href="https://blog.infine.com/from-c-to-simd-numerics-vector-and-hybridizer-3339">From C# to SIMD : Numerics.Vector and Hybridizer</a> first appeared on <a href="https://blog.infine.com">In Fine - Le Blog</a>.</p>]]></content:encoded>
					
					<wfw:commentRss>https://blog.infine.com/from-c-to-simd-numerics-vector-and-hybridizer-3339/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Mandelbrot with Hybridizer</title>
		<link>https://blog.infine.com/mandelbrot-with-hybridizer-3349?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=mandelbrot-with-hybridizer</link>
					<comments>https://blog.infine.com/mandelbrot-with-hybridizer-3349#respond</comments>
		
		<dc:creator><![CDATA[Florent]]></dc:creator>
		<pubDate>Thu, 10 Jun 2021 15:55:13 +0000</pubDate>
				<category><![CDATA[Non classé]]></category>
		<guid isPermaLink="false">https://blog.infine.com/?p=3349</guid>

					<description><![CDATA[<p><span class="rt-reading-time" style="display: block;"><span class="rt-label rt-prefix">Temps de lecture : </span> <span class="rt-time">4</span> <span class="rt-label rt-postfix">min.</span></span> We describe here the implementation of Mandelbrot fractal image generation using Hybridizer. The language of choice is C#, and implementation is done using 32-bits precision arithmetic. Mandelbrot set is the set of values c for which the sequence: {𝑧0=0 𝑧𝑛+1=𝑧2𝑛+𝑐 remains bounded in the complex plane.It happens an equivalent definition is: limsup𝑧𝑛 ≤2 𝑛→+∞ That &#8230;</p>
<p>The post <a href="https://blog.infine.com/mandelbrot-with-hybridizer-3349">Mandelbrot with Hybridizer</a> first appeared on <a href="https://blog.infine.com">In Fine - Le Blog</a>.</p>]]></description>
										<content:encoded><![CDATA[<span class="rt-reading-time" style="display: block;"><span class="rt-label rt-prefix">Temps de lecture : </span> <span class="rt-time">4</span> <span class="rt-label rt-postfix">min.</span></span>
<figure class="wp-block-image size-large"><a href="https://blog.infine.com/wp-content/uploads/2021/06/Mandelbrot-1.png" class="fancyboxgroup" rel="gallery-3349"><img fetchpriority="high" decoding="async" width="512" height="512" src="https://blog.infine.com/wp-content/uploads/2021/06/Mandelbrot-1.png" alt="" class="wp-image-3351" srcset="https://blog.infine.com/wp-content/uploads/2021/06/Mandelbrot-1.png 512w, https://blog.infine.com/wp-content/uploads/2021/06/Mandelbrot-1-300x300.png 300w, https://blog.infine.com/wp-content/uploads/2021/06/Mandelbrot-1-150x150.png 150w, https://blog.infine.com/wp-content/uploads/2021/06/Mandelbrot-1-60x60.png 60w" sizes="(max-width: 512px) 100vw, 512px" /></a></figure>



<p>We describe here the implementation of Mandelbrot fractal image generation using Hybridizer. The language of choice is C#, and implementation is done using 32-bits precision arithmetic.</p>
<p>Mandelbrot set is the set of values c for which the sequence:</p>
<div align="center"><span id="MathJax-Element-1-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-1" class="math"><span id="MathJax-Span-2" class="mrow"><span id="MathJax-Span-3" class="mrow"><span id="MathJax-Span-4" class="mo">{</span><span id="MathJax-Span-5" class="mtable"><span id="MathJax-Span-6" class="mtd"><span id="MathJax-Span-7" class="mrow"><span id="MathJax-Span-8" class="msubsup"><span id="MathJax-Span-9" class="mi">𝑧</span><span id="MathJax-Span-10" class="mn">0</span></span><span id="MathJax-Span-11" class="mo">=</span><span id="MathJax-Span-12" class="mn">0</span></span></span></span></span></span></span></span></div>
<div align="center"><span id="MathJax-Element-1-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-1" class="math"><span id="MathJax-Span-2" class="mrow"><span id="MathJax-Span-3" class="mrow"><span id="MathJax-Span-5" class="mtable"><span id="MathJax-Span-13" class="mtd"><span id="MathJax-Span-14" class="mrow"><span id="MathJax-Span-15" class="msubsup"><span id="MathJax-Span-16" class="mi">𝑧</span><span id="MathJax-Span-17" class="texatom"><span id="MathJax-Span-18" class="mrow"><span id="MathJax-Span-19" class="mi">𝑛</span><span id="MathJax-Span-20" class="mo">+</span><span id="MathJax-Span-21" class="mn">1</span></span></span></span><span id="MathJax-Span-22" class="mo">=</span><span id="MathJax-Span-23" class="msubsup"><span id="MathJax-Span-24" class="mi">𝑧</span><span id="MathJax-Span-25" class="mn">2</span><span id="MathJax-Span-26" class="mi">𝑛</span></span><span id="MathJax-Span-27" class="mo">+</span><span id="MathJax-Span-28" class="mi">𝑐</span></span></span></span><span id="MathJax-Span-29" class="mo"></span></span></span></span></span></div>
<p>remains bounded in the complex plane.<br>It happens an equivalent definition is:</p>
<div align="center"><span id="MathJax-Span-31" class="mrow"><span id="MathJax-Span-32" class="munderover"><span id="MathJax-Span-33" class="mo">limsup</span></span><span id="MathJax-Span-40" class="msubsup"><span id="MathJax-Span-41" class="mi">𝑧</span><span id="MathJax-Span-42" class="mi">𝑛 </span></span><span id="MathJax-Span-43" class="mo">≤</span><span id="MathJax-Span-44" class="mn">2</span></span></div>
<div align="center"><span id="MathJax-Span-31" class="mrow"><span id="MathJax-Span-32" class="munderover"><span id="MathJax-Span-34" class="texatom"><span id="MathJax-Span-35" class="mrow"><span id="MathJax-Span-36" class="mi">𝑛</span><span id="MathJax-Span-37" class="mo">→</span><span id="MathJax-Span-38" class="mo">+</span><span id="MathJax-Span-39" class="mi">∞</span></span></span></span></span></div>
<p>That means that while calculating&nbsp;<span id="MathJax-Element-3-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-45" class="math"><span id="MathJax-Span-46" class="mrow"><span id="MathJax-Span-47" class="msubsup"><span id="MathJax-Span-48" class="mi">𝑧</span><span id="MathJax-Span-49" class="mi">𝑛</span></span></span></span></span>&nbsp;values exceeds&nbsp;<span id="MathJax-Element-4-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-50" class="math"><span id="MathJax-Span-51" class="mrow"><span id="MathJax-Span-52" class="mn">2</span></span></span></span>&nbsp;at any iteration, the point c is not in the set.</p>
<h1>C# implementation</h1>
<p>In other words, this can be simulated via this code:</p>



<pre class="wp-block-code"><code class=""><strong>public</strong> <strong>static</strong> <strong>int</strong> IterCount(<strong>float</strong> cx, <strong>float</strong> cy)
{
    <strong>int</strong> result = 0;
    <strong>float</strong> x = 0.0f;
    <strong>float</strong> y = 0.0f;
    <strong>float</strong> xx = 0.0f, yy = 0.0f;
    <strong>while</strong> (xx + yy &lt;= 4.0f &amp;&amp; result &lt; maxiter) // are we out of control disk?
    {
        xx = x * x;
        yy = y * y;
        <strong>float</strong> xtmp = xx - yy + cx;
        y = 2.0f * x * y + cy; // computes z^2 + c
        x = xtmp;
        result++;
    }

    <strong>return</strong> result;
}</code></pre>



<p>which has to be run for every point (cx, cy) in the complex plane.<br>To produce an output image, we therefore compute IterCount for every pixel in the square&nbsp;<span id="MathJax-Element-5-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-53" class="math"><span id="MathJax-Span-54" class="mrow"><span id="MathJax-Span-55" class="mo">[</span><span id="MathJax-Span-56" class="mo">−</span><span id="MathJax-Span-57" class="mn">2</span><span id="MathJax-Span-58" class="mo">,</span><span id="MathJax-Span-59" class="mn">2</span><span id="MathJax-Span-60" class="mo">]</span><span id="MathJax-Span-61" class="mo">×</span><span id="MathJax-Span-62" class="mo">[</span><span id="MathJax-Span-63" class="mo">−</span><span id="MathJax-Span-64" class="mn">2</span><span id="MathJax-Span-65" class="mo">,</span><span id="MathJax-Span-66" class="mn">2</span><span id="MathJax-Span-67" class="mo">]</span></span></span></span>, discretized as a&nbsp;<span id="MathJax-Element-6-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-68" class="math"><span id="MathJax-Span-69" class="mrow"><span id="MathJax-Span-70" class="mi">𝑁</span><span id="MathJax-Span-71" class="mo">×</span><span id="MathJax-Span-72" class="mi">𝑁</span></span></span></span>&nbsp;square grid:</p>



<pre class="wp-block-code"><code class=""><strong>public</strong> <strong>static</strong> <strong>void</strong> Run(<strong>int</strong>[] light)
{
    <strong>for</strong> (<strong>int</strong> i = 0; i &lt; N; i += 1)
    {
        <strong>for</strong> (<strong>int</strong> j = 0; j &lt; N; j += 1)
        {
            <strong>float</strong> x = fromX + i * h;
            <strong>float</strong> y = fromY + j * h;
            light[i* N + j] = IterCount(x, y);
        }
    }
}</code></pre>



<p>where N, h, fromX and fromY are application parameters.<br>We here compute a&nbsp;<span id="MathJax-Element-7-Frame" class="MathJax" tabindex="0"><span id="MathJax-Span-73" class="math"><span id="MathJax-Span-74" class="mrow"><span id="MathJax-Span-75" class="mn">2048</span><span id="MathJax-Span-76" class="mo">×</span><span id="MathJax-Span-77" class="mn">2048</span></span></span></span>&nbsp;image using C# on a core i7-4770S @ 3.10GHz.<br>This unoptimized versions runs in 420 milliseconds, yielding</p>
<p>&nbsp;</p>
<div align="center"><strong>9.986</strong>&nbsp;millions pixels / second</div>
<p>Its crystal clear that this code is embarrassingly parallel, since all pixels are independant from each other.</p>
<p>A first trivial optimization would therefore to make the first loop parallel:</p>



<pre class="wp-block-code"><code class=""><strong>public</strong> <strong>static</strong> <strong>void</strong> Run(<strong>int</strong>[] light)
{
    Parallel.For(0, N, (i) =&gt; {
        <strong>for</strong> (<strong>int</strong> j = 0; j &lt; N; j += 1)
        {
            <strong>float</strong> x = fromX + i * h;
            <strong>float</strong> y = fromY + j * h;
            light[i * N + j] = IterCount(x, y);
        }
    });
}</code></pre>



<p>This second version runs in 67 milliseconds, giving:</p>
<div align="center"><strong>62.6</strong>&nbsp;millions pixels / second</div>
<h1>Run on the GPU</h1>
<p>In order to run that on a GPU, we just need to decorate the Run method with EntryPointAttribute:</p>



<pre class="wp-block-code"><code class="">[EntryPoint("run")]
<strong>public</strong> <strong>static</strong> <strong>void</strong> Run(<strong>int</strong>[] light)
{
    Parallel.For(0, N, (i) =&gt; {
        <strong>for</strong> (<strong>int</strong> j = 0; j &lt; N; j += 1)
        {
            <strong>float</strong> x = fromX + i * h;
            <strong>float</strong> y = fromY + j * h;
            light[i * N + j] = IterCount(x, y);
        }
    });
}</code></pre>



<p>and some boilerplate code to invoke the generated method:</p>
<div class="EnlighterJSWrapper enlighterEnlighterJSWrapper">&nbsp;</div>



<pre class="wp-block-code"><code class="">HybRunner runner = HybRunner.Cuda("Mandelbrot_CUDA.dll").SetDistrib(N, 128);
wrapper = runner.Wrap(<strong>new</strong> Program());
wrapper.Run(light_cuda);</code></pre>



<p>This modified code runs on the GPU (a 1080 Ti) in 10.6 milliseconds (32 when counting the memory copies), giving:</p>
<div align="center"><strong>395.7</strong>&nbsp;millions pixels / second</div>
<h1>Further optimization</h1>
<p>Launching a block per image line is highly suboptimal, due to the unevenly distribute nature of the computations. For example, threads at the square’s border will immediately converge, while those on the set will take the longest runtime.<br>This can be seen by profiling the above code using Nsight:</p>



<figure class="wp-block-image size-large"><a href="https://blog.infine.com/wp-content/uploads/2021/06/mandelbrot-2.png" class="fancyboxgroup" rel="gallery-3349"><img decoding="async" width="1024" height="397" src="https://blog.infine.com/wp-content/uploads/2021/06/mandelbrot-2-1024x397.png" alt="" class="wp-image-3352" srcset="https://blog.infine.com/wp-content/uploads/2021/06/mandelbrot-2-1024x397.png 1024w, https://blog.infine.com/wp-content/uploads/2021/06/mandelbrot-2-300x116.png 300w, https://blog.infine.com/wp-content/uploads/2021/06/mandelbrot-2-768x298.png 768w, https://blog.infine.com/wp-content/uploads/2021/06/mandelbrot-2-155x60.png 155w, https://blog.infine.com/wp-content/uploads/2021/06/mandelbrot-2.png 1485w" sizes="(max-width: 1024px) 100vw, 1024px" /></a></figure>



<p>As we can see, half of multiprocessors are idle.</p>
<p>We can instead distribute the work more evenly by using a 2D grid of relatively small blocks.</p>
<p>Fortunately, Hybridizer supports CUDA-like parallelism, so we can modify our entrypoint this way:</p>



<pre class="wp-block-code"><code class="">[EntryPoint("run")]
<strong>public</strong> <strong>static</strong> <strong>void</strong> Run(<strong>int</strong>[] light)
{
    <strong>for</strong> (<strong>int</strong> i = threadIdx.y + blockDim.y * blockIdx.y; i &lt; N; i += blockDim.x * gridDim.x)
    {
        <strong>for</strong> (<strong>int</strong> j = threadIdx.x + blockDim.x * blockIdx.x; j &lt; N; j += blockDim.y * gridDim.y)
        {
            <strong>float</strong> x = fromX + i * h;
            <strong>float</strong> y = fromY + j * h;
            light[i * N + j] = IterCount(x, y);
        }
    }
}</code></pre>



<p>&nbsp;</p>
<p>and run it with a 2D grid:</p>



<pre title="" class="wp-block-code"><code class="">HybRunner runner = HybRunner.Cuda("Mandelbrot_CUDA.dll").SetDistrib(32, 32, 16, 16, 1, 0);</code></pre>



<p>&nbsp;</p>
<p>The modified code runs now in 920 microseconds on the GPU, meaning:</p>
<div align="center"><strong>4.56 billions</strong>&nbsp;pixels / second</div>
<p>If we profile the newly generated kernel, we get:</p>
<p>&nbsp;</p>



<figure class="wp-block-image size-large"><a href="blob:https://blog.infine.com/1236a83f-581d-4730-9ce3-13176cc39721"><img decoding="async" src="blob:https://blog.infine.com/1236a83f-581d-4730-9ce3-13176cc39721" alt=""/></a></figure>



<p>main stall reason being “Other” with 69% followed by instruction fetch for only 5% of not eligible warps.<br>It’s reasonable to not try further optimization.</p>
<h1>Conclusion</h1>
<p>In this post we presented a basic usage of hybridizer, from unoptimal c# to highly efficient generated cuda code. We started by optimize our c# code and the migrated it seamlessly to CUDA.<br>We then optimized our c# code for CUDA GPUs, until we reached a close to peak performance level.</p><p>The post <a href="https://blog.infine.com/mandelbrot-with-hybridizer-3349">Mandelbrot with Hybridizer</a> first appeared on <a href="https://blog.infine.com">In Fine - Le Blog</a>.</p>]]></content:encoded>
					
					<wfw:commentRss>https://blog.infine.com/mandelbrot-with-hybridizer-3349/feed</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
