【51CTO专访】你听说过Pinterest没?没有的话你就Out啦。Pinterest是瀑布流图片展示形式的发源地,在2010年3月创建。创建之后的9个月内并没有引起太大的反响,不过从2010年年底开始,忽然发展的一发不可收拾,在今年年中的时候已经超越LinkedIn,成为全美第三大社交网站(前两位分别是Facebook和Twitter)。今年8月,Pinterest将邀请注册模式开放为自由注册模式,推出移动App,准备发起下一波用户入驻的高峰。
在8月的ArchSummit大会上,来自Pinterest的两位工程师到场分享了Pinterest的架构扩展历史。他们分别是Marty Weiner和Evrhet Milam。Marty在2011年1月加入Pinterest,Evrhet在2011年11月加入。对于这样一个初创企业而言,工程师加入的时间不会很长;但从他们的身上,笔者感受到了他们对Pinterest强烈的归属感与认同感。
Marty Weiner & Evrhet Milam
Marty名片上的Title很有意思,叫做Cloud Ninja——云忍者之意。本次对他俩的采访,就从这个Title的含义开始……
以下,S为51CTO编辑,M是Marty,E是Evrhet。
S:哈喽,Marty,Evrhet,感谢你们接受51CTO的采访。首先,Marty,你的名片上用了Cloud Ninja这个称号。这个称号很酷!是跟云计算有关系么?
M:是的。我们的工作都在云端控制。而且这个称号听起来很好玩:当四处出现各种问题的时候,你四处跳来跳去,感觉就跟忍者差不多。
S:那么,你懂得架构,你懂DevOps,DBA,安全,以及各种为了保持Pinterest稳定运行的技能。但你一开始是程序员。后面那些技能你是从哪里学的?
M:边干活儿边学呗。我们基本上都是。我之前做编译器的,对Web扩展什么的不咋了解。当然,我也做过另一种意义上的扩展,比如让Java跑的更快什么的,但那个领域不同,跟我现在做的方面层级不同。所以现在我们就是,每个人在各个方向都做点,架构,DevOps,数据库,产品设计……未来我们工程师多了之后应该会细分一些。Evr,你要不要谈谈你在做的事情?
E:我关注的领域是用户参与度的成长,也就是关心用户有没有在用我们的产品,有没有持续的使用,这样。所以我更多关注功能,关注让用户喜爱产品的点,让他们能够持续的从我们的产品中得到惊喜。
S:所以,偏产品?
E:是的。这包含很多事情,比如改良推荐的质量和相关性,改善界面,让用户的homefeed页面看起来很酷。
M:同时,每个人都时刻关注网站的状态,确保在网站出问题的时候可以快速的从备份恢复回来。
S:那你们没有产品经理喽?
M:目前还没有“官方”的产品经理。不过应该很快就有了吧。
S:所有的事情都是工程师在做?
M:哦,我们有一个创始人,他关注整个产品的愿景。我们有设计师。有关注商务方面的人。我们有关注国际化的同事。我们会一起交流,他们告诉我们该往哪些方向进行工程。从某种意义上来说,我们有专注于产品经理的人。
S:那么,工程团队是怎样工作的?怎么分工?
E:我们目前有差不多30位工程师。我们将他们分成很多小团队——越来越多的小团队。一开始我们有一个完整性团队,一个增长团队,然后是一些小团队,比如Spam团队,最近有一个Feed质量团队,还有一个平台团队——专门负责API开发和合作伙伴、第三方应用配合的团队。基本上就是很多关注不同功能和产品的零碎小团队。哦,我们还有一个移动团队。
M:然后我们招聘工程师的时候也会看,可能只有30%是看他的技术能力,看他是不是一个优秀的工程师;还有很重要的30%是文化层面的——我们要确保他们能够融入我们这个大团队的文化。他们有激情么?他们有趣么?他们是不是喜欢乱开玩笑?他们是否热衷于构建、创造东西,让东西跑起来?如果不是的话,如果他们只是无聊的码农,那他们就不适合我们的团队。
S:你们自己面试?
M:是啊,我们要确保团队文化的传承,确保新人能够匹配并延续现有的文化。这也是确保生产力很重要的一点。
S:这样的团队组织感觉很另类啊。
M:是的。我觉得一般企业在初创阶段都需要设立这样的门槛。确保你带进来的人都是有意思的,大家一起构建东西。
S:那,团队里的每个成员都是啥都懂么?
M:大部分成员对各方面都有一定的了解。他们不一定深入的了解MySQL,但至少知道一些。团队里面会进来之前做过MySQL的人,比如运维出身的人,他们就可以互相沟通。而这些运维出身的伙计们也需要了解编码,因为在出问题的时候他们需要有能力做一些修补。
S:那你的专长应该算是在哪个领域呢?
M:我目前在Spam团队,事实上我们的大部分工作像是救火队员,哪儿出了问题就扑过去补救。我这个团队在内部也被叫做BlackUPS,听起来很酷吧,但实际上我们的工作就是当出了问题,但又没有具体的团队负责这个问题的时候,我们就冲上去修补。确保各个方面都正常运转,这个职责本身就要求我们对各个方面都了解。
E:当然,我们也有专注于数据和推荐这方面的人员,他们相比我们更加专精。他们往往是机器学习或者数据分析相关的博士毕业的。除了他们之外,剩下的人都是万精油工程师,从服务器扩展,到开发新功能,啥都会。而一般而言我们需要的人也是啥都会才好。
S:有意思。那么,Marty,你在来Pinterest之前是做JVM的。Evh你呢?
E:我之前在Yelp做。也是一个万精油工程师,做一些信用卡安全防护,数据仓库,地理位置信息处理方面的工作。
S:你们觉得之前的工作和现在在Pinterest的工作有什么不同么?
M:在Pinterest我学习到很多在大的规模下如何做扩展的知识,以及在这种规模下系统是如何出现问题的。我在这里能够从比较高的角度观察整个系统。我学习到如何对大型系统做工程。以前做JVM虽然也有相同的问题,不过JVM和MySQL集群面临的问题非常不一样。在编译器这个级别,你这个系统只有两种状态,工作,或者不工作。不是1就是0。而在MySQL的情况下,你会发现这个系统里面有很多技术组件,每个组件的工作状态是95%!所以你编写东西的时候,就需要对这种不确定性,这种bug非常的容忍。有的时候一些组件坏掉了,不工作了,但整个系统还在,所以我们的工作之一就是要提升系统的容忍度——在任何不利的环境下仍然能跑着。当然,之前做编译器方面的经验还是帮了我很大的忙,至少我知道怎么去debug常规的问题,把它们解决掉,让它们回到工作状态。
E:Yelp的情况也差不多吧。都是运行一个用户量很大的网站。两个地方的体验差不多,不过我觉得在Yelp学到的最重要的事情之一就是知道了如何去扩展一个团队。你要确保代码是可维护的,有单元测试,有文档,以确保团队的其他成员可以继续我们的工作。Yelp在这方面做的非常棒,我从这个过程中学到很多,比如HR方面的流程,怎么弄来更多的工程师,以及讲正确的人放在正确的事上。
M:有一点我觉得很有趣,那就是我们做了很多扩展——Pinterest扩展的非常快,以至于很多代码都不成熟。有些代码简直是不堪入目。如果你不适当的放慢下来,这个情况就会越来越糟。所以我们招来的一些工程师,有专门负责稳定这个状态的:他们会重写大量的代码,把它们整理干净了,以准备好接下来的扩展。
S:听起来不错。那么,确保你有一个好的团队是很重要的事情。
M:是的。我们在1月的时候有5个工程师,现在有30个,整个团队变得更大了。其实很多事情跟你的规模无关,而是跟你运作的多快有关。当你从6个人变成15个人,或者15个到60个,或者100到150个,或者任何一次将团队规模扩充一倍的动作,你都需要重新定义所有的事情。我们要确保文化的传承,我们要确保团队成员们都紧密的团结在一起。我们要确保他们的代码不会发展的太快,但同时我们也不想设置太多的条条框框——那会减少乐趣,减少“我们在完成一些牛逼事情”的感觉。大公司往往有这样的问题:你想要改变?先找三层领导签字吧!那么目前我们仍然处于快速的状态,但还不至于特别松散。保持乐趣很重要。
S:真棒。那么,你们都做了很长时间的工程师,你们觉得过去这两三年是不是发生了很多变化?
M:对我而言,绝对是发展非常快速。
S:你觉得这对于开发者们而言是不是有些艰难?
M:怎么说呢,我们招进来的、我们认为合适的人,他们喜欢快速变化的环境。快的环境让他们感到愉悦。我们这边有很多人都是从大企业里出逃的,他们觉得那里太慢,所以想回到快速变化的环境中来。
S:他们想要快?
M:是的,他们想要在我们这样快速移动的初创企业中工作。我见过很多人,他们都说“我受够了那些管理了!我想写代码,完成产品,再写代码。我想完成一个产品又一个产品。Pinterest目前还足够年轻,提供了这样的机会。所以说,快速是我们的追求。
E:是啊是啊,我加入Pinterest的原因之一就是这个。这里有非常多的开发新产品的机会。回想我刚加入Pinterest的那两周,可能是我有生以来最紧张的两周。各种各样的问题层出不穷,你跑来跑去的,一刻也不能停。我之前在Yelp的两年从来没有过这样的经历——确实非常不一样。你会紧张,不停地解决问题,但同时这也带来极大地满足感。最后你会觉得,哦,这真有趣!在Pinterest的人都很好,我们像个大家庭,一起睡觉,一起出去玩什么的。我很享受这个过程。
M:我们经常做的一件事就是到看板前面看我们的用户数量增长的曲线。怎么说呢,无论发生了什么事,你加班到凌晨3点也好,或者隔一两周来个通宵也好,只要一看到那条曲线,你就会觉得:这是值得的,这是值得的!而且,这个产品本身就是个非常有趣的产品。Pinterest很美——我们有非常好的设计师,有好的愿景,而且它非常简单。有很多喜爱Pinterest的人,我知道的,我喜欢上Twitter看所有有关Pinterest的推。“我爱死Pinterest了。”或者是那些经过了一天紧张工作的人们说,“好吧,紧张的一天,但至少我还有Pinterest~”看到这些的时候,我真的有很大的满足感。他们的轻松缓解了我的压力,因为我知道,我的压力为很多人带来了轻松!
S:太棒了。那么,最后一个问题。你们的分享提到了为什么Pinterest做扩展的时候选择了分片而非集群。那么,这个决策是如何定的?整个流程是怎么样的,是你们做做调查比较一下就下决定,还是要做测试、数据分析什么的?
M:这个啊,一开始当我们只有2、3个人的时候,很多情况下其实就是“好吧,我们希望这能行”,就去做了。我们当时没有足够的条件去测试,一一筛选那些不合适的技术。而当我们知道某项方案可行的时候,我们会做一个拆分,选择我们认为可能给我们最佳结果的方式去做。最后它成功运行,其实是有一些运气的成分。当然,现在我们有能力做一些测试了,不过更多的情况下我们还是直接构建了系统,推送到部分的服务器上,看是不是会出现什么问题,再做进一步动作。比如我们的关注服务,一开始会推送给10%的用户,看看运行的情况,发现有些问题,解决掉它们。然后再推送到20%,好吧又出现了新问题,那么解决它们,再前进。所以,就这样直接在生产环境中进行测试,直到我们推送到100%,这就完成啦。控制起来并不困难。你可以推送模拟的负载来测试系统,观察系统是如何反应的,但是模拟负载与真实的负载从来都不同。从来都不同。
E:同时,这也取决于你的团队氛围。当有一些新的想法的时候,工程师们会讨论它们。我们对这新想法感到兴奋么?好的想法往往能引发更多的讨论,工程师们也会开始尝试将新想法带进来,测试,部署。总之,是工程师对新想法的兴奋感,这才能为我们遇到的问题带来好的解决方案。
S:有领导者吗?
M:我们有一个Head of Engineers。然后,每个小团队都有一个小组长,不过小组长主要的工作不是管理,而是带领团队一起编码。当然,还有我们的创始人。两个创始人默认在所有人之上的。
S:好的。那么,十分感谢Marty和Evrhet接受我们的采访!
第二页是英文采访实录。
#p#
s: hi Marty, Evrhet, thanks for taking our interview!
You call yourself a cloud ninja. Does this have anything to do with cloud computing?
M: yes. When we do our work, the control system is all on the cloud. And it just sounds funny. When there are lots of problems going on there, you have to be a little bit ninja.
S: you know architect, devops, dba, security, and everything else to keep pinterest running. And you started off as a coder. When did you learn all these things?
M: on the job. both of us. i was a compiler guy, and i didn't do scaling for web. i did big scaling on the other side of the game, like making java run faster, but that's on the other side. i didn't do things quite on that level. so that's the way we are going. and we do a little bit of everything, architect, devops, db guy, product feature...eventually we get segment as we get more engineers. i do spam. Evr, how about you talk what you do?
e: i'm focused on engagement growth, so trying to make sure more users are using our product, and continue to use it, day by day. so just features, making people enjoy the product, help them find contents they are interested in.
s: so a lot on the product side?
e: mine is a lot more product focused, so we do things to make sure you have good recommendations, they are pretty to look at, make users happy when seeing their homefeed.
m: every eyeball still keeps an eye on the website, in case it breaks, get a backup and get it back running.
s: do you have a product manager?
m: not yet. no official individual product manager yet, but i think we'll get one.
s: so just the engineers doing everything?
m: oh we have a founder who is focused on the vision. we have some great designers. and people focused on the business. we have people focused on internalization, and help us understand what we need to engineer that. so we have product management in implicit way, but not a specific product manager yet.
s: so how is the engineering team in pinterest like? how do you distribute work?
e: i think we have over 30 engineers. and we are splitting them into more and more teams. we have originally an integrity team, and a growth team, and then we had even smaller teams, like the spam team, recently a team related to feed quality, and we have a platform team, that just worries about the apis, and making the platform available for partners and third-party apps. so we just have a lot of small teams flying around different features and products we have. and we have a mobile team, too.
m: and in terms of the type of people that we look for, 30% of our interview is on technical aspects, to make sure they are good engineers, but a 30% is on cultural - how good they can fit into our culture. are they motivated, are they fun, and they like to joke around. do they want to build things and make them run, and create. if they are not like that, if they can only write code but they are not fun, we won't hire them.
s: so you interviewed them?
m: yes, we all interview the new people coming in, that's the culture we want going forward, so that to make sure people coming in match the current culture, in the way that you like, i think it's productive.
s: that's quite unique.
m: yes, i think that barrier is common at the early stage of companies. make sure the people bring in a having lots of fun. build things.
s: so do all the members of the teams know everything?
m: most people know a little bit about every piece, but they may not know, day to day, about mysql, we are a little bit segmented. but a lot of people we hire have just done different pieces, some of them know well about mysql if that's what they previously did. but people who know mysql are most probably the ops guys, and they need to know code, need to know how to fix things when it breaks.
s: are you regarded as specialized in certain fields, then?
m: i'm working on spam right now, we tend to jump on whatever needs to be fixed now and then. i'm actually on the team that we call "blackups", it sounds cool, but what it does is if there is something broken and there is not a specific team to fix it, we'll go and fix it. we are the people that make sure everything is fluid and working. so this role requires us to know a little bit about everything.
e: we do have some people who are focused on data and recommendations, they are a bit more specialized than the rest of us. they have knowledge of phds, on machine learning or things like that, or analytical knowledge. most of us are like generalist engineers, and can come to deal with scaling servers, add a feature, and things like that. usually we need people who can jump onto anything.
s: that's interesting. so marty, before you joined pinterest you were doing jvm. how about eve?
e: i was at yelp. a generalist engineer that was doing credit cards protection, data warehousing, and locations.
s: so what difference do you think there is between working in pinterest and working in your previous post?
m: i think i really learned how to scale things on a larger scale, how things break down, i get to see things from a top level. i really learned how to engineer very large systems. it's the same problem, but also a very different problem from scaling mysql. from a compiler level, your system(?) either work or doesn't work, it's 1 or 0. for mysql, you find yourself surrounded by technology that works 95%. you have to write things that are very tolerant to that, tolerant to the bugs. some of them don't work, some of them break away, so part of our job is to make things tolerant - make them work no matter what is happening. i think the compiler work helped a lot for me, just for debugging regular problems, beating them down to the ground to make them work.
e: yelp does pretty much the same thing, running a website that has a lot of users. so i have a lot of similar experience that i have now. but i think yelp has a lot of knowledge in scaling a team. making sure the code is maintainable, has unit tests, is documented, so that the rest of the team can carry on with the code. yelp had a good job in doing that, and i learned a lot of these processes, like hr processes, getting more engineers, getting the right people for the job.
m: one thing i think is interesting is that, it is just two of us for wild scaling, when pinterest is scaling fast, a lot of code is not matured. some code there is just really nasty, and if you don't slow down a bit, it will just become worse and worse. so some engineers we brought in, their work is to stablize, so they come in and rework a lot of the codes to make them cleaner, and more scalable for later.
s: cool. so the important thing is to get the right team.
m: yes. in january we had 5 engineers, and now we have 30. so we are getting bigger in size. it actually doesn't matter what scale you are at, but about how fast things move. so when we moved from 6 to 15 engineers, and 15 to 60, or 100 to 150, or any doubling the size, you have to rework everything that you used to. so we have to make sure the culture matches, and we have to make sure we have properly related our team. we have to make sure their coding doesn't fly too fast, but at the same time we don't want to place too many rules around the place, that will slow down the fun, slow down the feeling that you are getting something done. bigger companies have this problem: i want to make a change, i have to go three levels for approval. right now we are still fast, but not super loose. keep it fun.
s: cool. so you have been engineers for quite a long time, do you think things have changed a lot in these 2-3 years?
m: it certainly is moving fast for me.
s: and do you think this has made things more difficult for developers?
m: i think we get the right people who likes things to be fast moving. it makes things better for them. we get a lot of people who come from big big companies that have slowed down. and they want to come back to the fast.
s: so they want the fast changes?
m: yes. they want to be in small startups that move fast. we've got a lot of people who say "i'm sick of all the management. i want to make code and move. i want to make another product and move. pinterest is still young enough so they can get that opportunity. so fast is what we thrive on.
e: oh yes, that's one of the reasons i joined pinterest. there are so many opportunities to work on new products. Right after I joined, it was probably the most stressful two weeks of my life. Thing were breaking all the time, you keep going back to a fixing mode - i've never experienced that before in the two years at yelp. it's really different. it can be stressful, keep solving problems, but it is really satisfying, very very fun at the end of the day. so in pinterest we have all the right people, like families, sleep and hangout together, i really enjoyed it.
m: one thing is that we kept watching that graph of growing number of users that keeps rising. so no matter what happens, whether we worked up till 3am or even all night every week or the other week, as long as you saw that graph, it's like, oh, it's worth it, it's worth it~ and also it's a fun product to build. it's a really beautiful product, we get really good designers, visionaries, and it's really simple. a lot of people had so much fun with it, and i like going on to twitter and reading all these people saying "i love pinterest". a lot of people would go in the end of stressful days and say, oh at least i've got my pinterest and my key. reading those is really satisfying. you know a lot of people is relaxing, you know, my stress! at least my stress is going to relax somebody:)
s: cool. so last question. your session mentioned choosing sharding over clustering. so who make those decisions? what's the process like? do you just go ahead and decide, or do some data, testing before you went for it?
m: so early on, when there were only 2 or 3 of us, it was pretty much like "we hope this will work", because we just didn't have the room to test and bring up a lot of technologies that didn't work. and we knew what was working, so we sort of extract it, and hope for the best thing we got. and in a way we got lucky as it turned out to be working. now we do have the capacity to do a bit of testing, but we are still kind of build the system and start push it to a little bit of entity(?) and see how it does, and if there are problems, fix them. For the follower service we started off pushing 10% and see, ok we've found a few problems, fix those. push 20%, ok a few more problems, and so on. so we have this bring things to production, but just push some things to it. and when it's 100%, you are done. so in a way we are kind of test in production, but in control. and you can always push loads to the system, you can see how the system responds, but it never ever matches the real load.
e: also i think it's up to the team you have. when new ideas come up, engineers are going to talk about it. are they excited about it? good ideas tend to have lots of chats, and people will start bringing them in. test them. deploy them. they have to be excited about the new idea - good solutions to the problems we have.
s: so do you have some kind of a leader?
m: we have a head of engineers. and then, every team has a team leader who is very down into the code - he is not managing, just code along and lead the teams. and we have a founder, two founders who are automatically on top of everything.