均值、方差的map-reduce
一堆数字的均值、方差公式,相信都很清楚,具体怎么设计map跟reduce函数呢,可以先从计算公式出发,假设有n个数字,分别是a1,a2....an,那么 均值m=(a1+a2+...an) / n,方差 s= [(a1-m)^2+(a2-m)^2+....+(an-m)^2] / n
把方差公式展开来S=[(a1^2+.....an^2)+m^m*n-2*m*(a1+a2+....an) ] / n,根据这个我们可以把map端的输入设定为(key,a1),输出设定为(1,(n1,sum1,var1)),n1表示每个worker所计算的数字的个数,sum1是这些数字的和(例如a1+a2+a3...),var1是这些数字的平方和(例如a1^2+a2^2+...)
reduce端接收到这些信息后紧接着把所有输入的n1,n2....相加得到n,把sum1,sum2...相加得到sum,那么均值m=sum/n,把var1,var2...相加得到var,那么***的方差S=(var+m^2*n-2*m*sum)/n,reduce输出(1,(m,S))。
算法代码是基于mrjob的实现(https://pythonhosted.org/mrjob/,机器学习实战第十五章)
- from mrjob.job import MRJob
- class MRmean(MRJob):
- def __init__(self, *args, **kwargs):
- super(MRmean, self).__init__(*args, **kwargs)
- self.inCount = 0
- self.inSum = 0
- self.inSqSum = 0
- def map(self, key, val): #needs exactly 2 arguments
- if False: yield
- inVal = float(val)
- self.inCount += 1
- self.inSum += inVal #每个元素之和
- self.inSqSum += inVal*inVal #求每个元素的平方
- def map_final(self):
- mn = self.inSum/self.inCount
- mnSq =self.inSqSum/self.inCount
- yield (1, [self.inCount, mn, mnSq]) #map的输出,不过这里的mn=sum1/mn,mnsq=var1/mn
- def reduce(self, key, packedValues):
- cumVal=0.0; cumSumSq=0.0; cumN=0.0
- for valArr in packedValues: #get values from streamed inputs 解析map端的输出
- nj = float(valArr[0])
- cumN += nj
- cumVal += nj*float(valArr[1])
- cumSumSq += nj*float(valArr[2])
- mean = cumVal/cumN
- var = (cumSumSq - 2*mean*cumVal + cumN*mean*mean)/cumN
- yield (mean, var) #emit mean and var reduce的输出
- def steps(self):
- return ([self.mr(mapper=self.map, mapper_final=self.map_final,\
- reducer=self.reduce,)])
- if __name__ == '__main__':
- MRmean.run()