浅尝.NET 4并行计算效率已大幅提升-并行计算场景

随着Visual Studio 2010正式版的发布，我们已经可以用上.NET 4的所有功能。那么对于并行计算的尝试，是本文的重点。51CTO向您推荐《F#函数式编程语言》，以便于您全方位了解.NET 4中不同的部分。

我们都知道CPU的性能至关重要，但主频已经越来越难以提升，纵向发展受限的情况下，横向发展成为必然——核心数开始越来越多。然而多核心的利用、并行计算一直是编程中的难题，大的不说，就说代码的编写，程序员大多都有过痛苦的经历：多线程的程序代码量大，编写复杂，容易出错，并且实际运行效率是否理想也较难保证。

为改善这种状况，.NET 4.0中引入了 TPL(任务并行库)，关于TPL，MSDN的简介是：

任务并行库 (TPL) 的设计是为了能更简单地编写可自动使用多处理器的托管代码。使用该库，您可以非常方便地用现有序列代码表达潜在并行性，这样序列代码中公开的并行任务将会在所有可用的处理器上同时运行。通常这会大大提高速度。

简而言之，TPL提供了一系列的类库，可以使编写并行运算的代码更简单和方便。

说起来很简单，我们来看点例子：

void ThreadpoolMatrixMult(int size, double[,] m1, double[,] m2,   
     double[,] result)  
 {  
   int N = size;                             
int P = 2 * Environment.ProcessorCount; // assume twice the procs for   
                                       // good work distribution  
   int Chunk = N / P;                  // size of a work chunk  
   AutoResetEvent signal = new AutoResetEvent(false);   
   int counter = P;                      // use a counter to reduce   
                                        // kernel transitions      
  for (int c = 0; c < P; c++) {         // for each chunk  
    ThreadPool.QueueUserWorkItem(delegate(Object o)  
    {  
      int lc = (int)o;  
  for (int i = lc * Chunk;           // iterate through a work chunk  
     i < (lc + 1 == P ? N : (lc + 1) * Chunk); // respect upper   
                                               // bound  
           i++) {  
        // original inner loop body  
        for (int j = 0; j < size; j++) {  
          result[i, j] = 0;  
          for (int k = 0; k < size; k++) {  
            result[i, j] += m1[i, k] * m2[k, j];  
          }  
        }  
      }  
   if (Interlocked.Decrement(ref counter) == 0) { // use efficient   
                                                // interlocked   
                                               // instructions        
        signal.Set();  // and kernel transition only when done  
      }  
    }, c);   
  }  
  signal.WaitOne();  
}

很眼熟但同时看着也很心烦的代码吧。在换用TPL后，上面的代码就可以简化为：

void ParMatrixMult(int size, double[,] m1, double[,] m2, double[,] result)  
 {  
   Parallel.For( 0, size, delegate(int i) {  
     for (int j = 0; j < size; j++) {  
      result[i, j] = 0;  
       for (int k = 0; k < size; k++) {  
         result[i, j] += m1[i, k] * m2[k, j];  
       }  
     }  
  });  
}

舒服多了吧？具体的内容请见MSDN的文章优化多核计算机的托管代码。

装好正式版的VS2010以后，写了段代码来测试下，TPL究竟好不好用。

代码很简单，拿一条字符串和一堆字符串里的每一条分别用LevenshteinDistance算法做字符串相似程度比对。先用传统的顺序执行的代码跑一遍，记录下时间；再换用TPL的并行代码跑一遍，记录下时间。然后比对两次运行的时间差异。

using System;  
  using System.Collections.Generic;  
  using System.Linq;  
  using System.Text;  
  using System.Threading.Tasks;  
  using System.Diagnostics;  
    
  namespace ParallelLevenshteinDistance  
  {  
     class Program  
     {  
         static void Main(string[] args)  
         {  
             Stopwatch sw;  
   
             int length;  
             int count;  
             string[] strlist;  
             int[] steps;  
             string comparestring;  
   
             Console.WriteLine("Input string lenth:");  
             length = int.Parse(Console.ReadLine());  
   
             Console.WriteLine("Input string list count:");  
             count = int.Parse(Console.ReadLine());  
   
             comparestring = GenerateRandomString(length);  
             strlist = new string[count];  
             steps = new int[count];  
   
             // prepare string[] for comparison  
             Parallel.For(0, count, delegate(int i)  
             {  
                 strlist[i] = GenerateRandomString(length);  
             });  
   
             Console.WriteLine("{0}Computing...{0}", Environment.NewLine);  
   
             // sequential comparison  
             sw = Stopwatch.StartNew();  
             for (int i = 0; i < count; i++)  
             {  
                 steps[i] = LevenshteinDistance(comparestring, strlist[i]);  
             }  
             sw.Stop();  
             Console.WriteLine("[Sequential] Elapsed:");  
             Console.WriteLine(sw.Elapsed.ToString());  
   
             // parallel comparison  
             sw = Stopwatch.StartNew();  
             Parallel.For(0, count, delegate(int i)  
             {  
                 steps[i] = LevenshteinDistance(comparestring, strlist[i]);  
             });  
             sw.Stop();  
             Console.WriteLine("[Parallel] Elapsed:");  
             Console.WriteLine(sw.Elapsed.ToString());  
                           
             Console.ReadLine();  
         }  
   
         private static string GenerateRandomString(int length)  
         {  
             Random r = new Random((int)DateTime.Now.Ticks);  
             StringBuilder sb = new StringBuilder(length);  
             for (int i = 0; i < length; i++)  
             {  
                int c = r.Next(97, 123);  
                 sb.Append(Char.ConvertFromUtf32(c));  
             }  
             return sb.ToString();  
         }  
   
         private static int LevenshteinDistance(string str1, string str2)  
         {  
             int[,] scratchDistanceMatrix = new int[str1.Length + 1, str2.Length + 1];  
             // distance matrix contains one extra row and column for the seed values              
             for (int i = 0; i <= str1.Length; i++) scratchDistanceMatrix[i, 0] = i;  
             for (int j = 0; j <= str2.Length; j++) scratchDistanceMatrix[0, j] = j;  
   
             for (int i = 1; i <= str1.Length; i++)  
             {  
                 int str1Index = i - 1;  
                 for (int j = 1; j <= str2.Length; j++)  
                 {  
                     int str2Index = j - 1;  
                     var cost = (str1[str1Index] == str2[str2Index]) ? 0 : 1;  
   
                     int deletion = (i == 0) ? 1 : scratchDistanceMatrix[i - 1, j] + 1;  
                     int insertion = (j == 0) ? 1 : scratchDistanceMatrix[i, j - 1] + 1;  
               int substitution = (i == 0 || j == 0) ? cost : scratchDistanceMatrix[i - 1, j - 1] + cost;  
   
                     scratchDistanceMatrix[i, j] = Math.Min(Math.Min(deletion, insertion), substitution);  
   
                     // Check for Transposition  
  if (i > 1 && j > 1 && (str1[str1Index] == str2[str2Index - 1]) && (str1[str1Index - 1] == str2[str2Index]))  
                     {  
scratchDistanceMatrix[i, j] = Math.Min(scratchDistanceMatrix[i, j], scratchDistanceMatrix[i - 2, j - 2] + cost);  
                    }  
                }  
            }  
 
            // Levenshtein distance is the bottom right element  
            return scratchDistanceMatrix[str1.Length, str2.Length];  
        }  
 
    }  
}

这里只用了最简单的 Parallel.For 方法，代码很简单和随意，但是看看效果还是可以的。

测试机找了不少，喜欢硬件的朋友兴许也能找到你感兴趣的:P

Intel Core i7 920 (4物理核心8逻辑核心，2.66G) + DDR3 1600 @ 7-7-7-24

AMD Athlon II X4 630 (4物理核心，2.8G) + DDR3 1600 @ 8-8-8-24

AMD Athlon II X2 240 (2物理核心，2.8G) + DDR2 667

Intel Core E5300 (2物理核心，2.33G) + DDR2 667

Intel Atom N270 (1物理核心2逻辑核心，1.6G) + DDR2 667

还在VM workstation里跑过，分别VM了XP和WIN7，都跑在上述i7的机器里，各自分配了2个核心。

程序设置每个字符串长1000个字符，共1000条字符串。

每个机器上程序都跑了3遍，取平均成绩，得到下表：

CPU	Core	Time_Sequential(s)	Time_Parallel(s)	S/P(%)
Intel Core i7 920	4 Cores, 8 Threads, 2.6G	55.132634	14.645687	376.44%
AMD AthlonII X4 630	4 Cores, 4 Threads, 2.8G	58.10592	17.152494	338.76%
AMD AthlonII X2 240	2 Cores, 2 Threads, 2.8G	66.159735	32.293972	204.87%
Intel E5300	2 Cores, 2 Threads, 2.3G	70.827157	38.50654	183.94%
Intel Atom N270	1 Cores, 2 Threads, 1.6G	208.47852	157.27869	132.55%
VMWin7(2 logic core)	2 Cores, 2 Threads	56.965068	33.069084	172.26%
VMXP(2 logic core)	2 Cores, 2 Threads	59.799399	35.35805	169.13%

可见，在多核心处理器上，并行计算的执行速度都得到了大幅提升，即便是在单核心超线程出2个逻辑核的Atom N270上亦缩短了32.55%的运行时间。在A240上并行计算的效率竟然是顺序计算的204.87% ？！而同样是4核心，i7 920在超线程的帮助下，并行执行效率提升明显高过A630。最后VM里的测试，是否也可以在某种程度上佐证在多核心的调度上，Win7要强过XP呢（纯猜测）？顺带可以看到，同样是i7的硬件环境，单线程宿主OS(Win7)里执行花费55.133秒，VM(Win7)里56.965秒，速度上有约3%的差异。

另外，针对性能较强的i7处理器，加大程序中的2个变量后再做测试，并行执行的效率比得到了进一步的提升。应该是因为创建/管理/销毁多线程的开销被进一步的摊平的缘故。例如在每字符串2000个字符，共2000条字符串的情况下，顺序执行和并行执行的时间分别是07:20.9679066和01:47.7059225，消耗时间比达到了409.42%。

来几张截图：

从截图中可以发现，这段测试程序在顺序执行的部分，内存占用相对平稳，CPU则大部分核心处在比较空闲的状态。到了并行执行部分，所有核心都如预期般被调动起来，同时内存占用开始出现明显波动。附图是在每字符串2000个字符，共2000条字符串的情况下得到的。

这里只是非常局部和简单的一个测试，目的是希望能带来一个直观的体验。微软已经提供了一组不错的例子可供参考 Samples for Parallel Programming with the .NET Framework 4

原文标题：小试 .NET 4.0 之并行计算

链接：http://www.cnblogs.com/Elvin/archive/2010/04/20/1716258.html

【编辑推荐】

浅尝.NET 4并行计算 效率已大幅提升

浅尝.NET 4并行计算效率已大幅提升