本文转载自微信公众号「数师兄」,作者数师兄。转载本文请联系数师兄公众号。
在上期的SQL题目中,我们以求连续活跃为例,讲述了lead 、lag错位相减的用法。但若不采用错位相减的话,其实也可以实现这一目的。本次我们就先讲述一下最大连续活跃天数的第二种解法,然后再讲述另一常见面试SQL题目。
这一方法就是利用row_number给每一行生成一个连续的序号,这样,在连续活跃的时间段,就有了一个对照值,利用日期和行号这两者就可以得到一个差值,在连续活跃的时段内,其差值是相同的,那么我们按照这个差值进行分组就可以进一步计算本次连续活跃有多少天了。
SQL如下
- WITH active AS
- (SELECT 100 UID,
- '2021-04-01'dt
- UNION ALL SELECT 101 UID,
- '2021-04-01'dt
- UNION ALL SELECT 102 UID,
- '2021-04-01'dt
- UNION ALL SELECT 103 UID,
- '2021-04-01'dt
- UNION ALL SELECT 100 UID,
- '2021-04-02'dt
- UNION ALL SELECT 101 UID,
- '2021-04-02'dt
- UNION ALL SELECT 102 UID,
- '2021-04-02'dt
- UNION ALL SELECT 103 UID,
- '2021-04-02'dt
- UNION ALL SELECT 104 UID,
- '2021-04-02'dt
- UNION ALL SELECT 100 UID,
- '2021-04-03'dt
- UNION ALL SELECT 104 UID,
- '2021-04-03'dt
- UNION ALL SELECT 101 UID,
- '2021-04-04'dt
- UNION ALL SELECT 102 UID,
- '2021-04-04'dt
- UNION ALL SELECT 103 UID,
- '2021-04-04'dt
- UNION ALL SELECT 104 UID,
- '2021-04-04'dt
- UNION ALL SELECT 105 UID,
- '2021-04-04'dt
- UNION ALL SELECT 102 UID,
- '2021-04-03'dt)
- SELECT UID,
- max(days)days
- FROM
- (SELECT date_sub(dt,rn),
- UID,
- count(DISTINCT dt)days
- FROM
- (SELECT UID,
- dt,
- row_number()over(PARTITION BY UID
- ORDER BY dt)rn
- FROM active)x
- GROUP BY date_sub(dt,rn),
- UID)y
- GROUP BY UID
结果如下:
至此,连续活跃的问题就告一段落了。下面我们来看另一个常见问题:
2.行列转换
(1)某直播产品,直播记录表lives中有liveid(直播id)、invite_list(邀约名单,string,用户id以“,”分割)字段,直播观看记录表record中有uid(用户id)、liveid(直播id)、duration(观看时长),试求出邀约名单中所有用户的观看时长
由于存在邀约用户未进房观看的问题,所以我们直接从观看记录中来取,不一定能将名单中所有用户包含进去,所以,我们需要以直播记录中的邀约名单为准。这就需要利用explode和lateral view来将这个string类型转换为行记录的形式。
仍然是使用with构造lives和record两个简单的临时表,详细SQL如下:
- with lives as(
- select 100 liveid,'A01,A02,A03,A04'invite_list union all
- select 101 liveid,'A05,A06,A07,A08'invite_list
- ),
- record as(
- select 100 liveid,'A01' uid,30 duration union all
- select 100 liveid,'A02' uid,50 duration union all
- select 100 liveid,'A03' uid,15 duration union all
- select 101 liveid,'A07' uid,20 duration union all
- select 101 liveid,'A08' uid,60 duration
- )
- select
- list.liveid,list.ulist,if(record.duration is null,0,record.duration)duration
- from (
- select liveid,ulist
- from lives
- lateral view explode(split(invite_list,',')) uid as ulist
- )list left join record on list.liveid=record.liveid and list.ulist=record.uid
这里简单说明下:lateral view 是为原表调用explode函数,将一个array或者map的字段分解成多行并作为一个临时表,然后再与其他字段组合到一起。
(2)某APP用户访问页面记录record,含有字段uid(用户id)、pageid(页面id),将每个用户的访问路径输出在一个字段中
前一道题目是列转行,这道题是行专列。利用collect_set可以将分组下面某个字段的所有记录聚合成一个列表,然后可以进一步使用concat_ws函数把array转为字符串,并用“>”将其连接起来。
- WITH record as
- (SELECT 100 UID,'A01' pageid
- UNION ALL SELECT 100 UID,'A02' pageid
- UNION ALL SELECT 100 UID,'A03' pageid
- UNION ALL SELECT 101 UID,'A07' pageid
- UNION ALL SELECT 101 UID,'A08' pageid)
- SELECT UID,
- concat_ws('>',collect_list(pageid))page
- FROM record
- GROUP BY UID
3.求留存
对于多数互联网公司来说,任何时候留存率都是重中之重。所以在数据分析的面试当中,求留存率的SQL题目出现的频率也是极高。
所谓“留存率”,通常是定义一个初始行为与一个后继行为,符合初始条件的用户量在经过N天之后,剩余的比例就是留存率。发散来讲,回访、复看等概念也与留存相类似。如:
某APP新注册用户表user,含有字段uid(用户id)、dt(注册日期),活跃表active,有字段uid(用户id),dt(活跃日期),求每天新增用户的次日留存率、次2日留存率、次3日留存率……次7日留存率
此题目就是以新注册作为初始行为,活跃作为后继行为。显而易见地,我们可以将初始行为与后继行为做左连接,然后按注册日期进行聚合,求相应的新增用户和次N日留存人数。
- WITH user as
- (SELECT 100 UID,'2021-01-01' dt
- UNION ALL SELECT 101 UID,'2021-01-01' dt
- UNION ALL SELECT 102 UID,'2021-01-01' dt
- UNION ALL SELECT 103 UID,'2021-01-02' dt
- UNION ALL SELECT 104 UID,'2021-01-02' dt
- UNION ALL SELECT 105 UID,'2021-01-02' dt
- UNION ALL SELECT 106 UID,'2021-01-02' dt),
- active as(
- select 100 UID,'2021-01-01'dt union all
- select 100 UID,'2021-01-02'dt union all
- select 100 UID,'2021-01-03'dt union all
- select 100 UID,'2021-01-05'dt union all
- select 100 UID,'2021-01-07'dt union all
- select 101 UID,'2021-01-01'dt union all
- select 101 UID,'2021-01-07'dt union all
- select 102 UID,'2021-01-01'dt union all
- select 103 UID,'2021-01-01'dt union all
- select 103 UID,'2021-01-02'dt union all
- select 103 UID,'2021-01-03'dt union all
- select 103 UID,'2021-01-05'dt union all
- select 104 UID,'2021-01-02'dt union all
- select 104 UID,'2021-01-03'dt union all
- select 104 UID,'2021-01-04'dt union all
- select 105 UID,'2021-01-02'dt union all
- select 105 UID,'2021-01-03'dt union all
- select 105 UID,'2021-01-04'dt union all
- select 105 UID,'2021-01-05'dt union all
- select 105 UID,'2021-01-07'dt union all
- select 106 UID,'2021-01-02'dt union all
- select 106 UID,'2021-01-04'dt
- )
- select
- user.dt,
- count(distinct user.uid)new_user,
- count(distinct if(datediff(active.dt,user.dt)=1,active.uid,null))retain1,
- count(distinct if(datediff(active.dt,user.dt)=2,active.uid,null))retain2,
- count(distinct if(datediff(active.dt,user.dt)=3,active.uid,null))retain3,
- count(distinct if(datediff(active.dt,user.dt)=4,active.uid,null))retain4,
- count(distinct if(datediff(active.dt,user.dt)=5,active.uid,null))retain5,
- count(distinct if(datediff(active.dt,user.dt)=6,active.uid,null))retain6,
- count(distinct if(datediff(active.dt,user.dt)=7,active.uid,null))retain7
- from user left join active on user.uid=active.uid
- group by user.dt
其结果如下:
这种方法可能比较麻烦一些,得把要求的每一个留存指标都要单独写一个字段。若想省事儿,可以单独求出每日的新增数量,再将新增表与活跃表连接而求出每个注册日期在后续每天的活跃数量,然后将新增数量与活跃数量再进行关联,从而求得任意周期的留存。