如何替换URL中的Query字段？-sql替换字段中指定的字符

[[420519]]

在我们写爬虫的时候，可能会需要在爬虫里面基于当前url生成一个新的url。例如下面这段伪代码：

import re 
current_url = 'https://www.kingname.info/archives/page/2/' 
current_page = re.search('/(\d+)', current_url).group(1) 
next_page = int(current_page) + 1 
next_url = re.sub('\d+', str(next_page), current_url) 
make_request(next_url)

运行效果如下图所示：

但有时候，翻页参数不一定是数字。例如有些网站，访问一个URL：https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD

当你访问这个url的时候，它返回的是一个JSON字符串，并且这个JSON里面，有如下字段：

... 
"paging": { 
        "cursors": { 
            "before": "MTA3NDU0NDExNDEzNTgz", 
            "after": "MTE4OTc5MjU0NDQ4NTkwMgZDZD" 
        }, 
         
    } 
...

这种情况多见于信息流网站。它只能无限下滑看下一页，不能直接通过页数跳页。每次请求的时候返回下一页的参数after。当要访问下一页的时候，用这个参数替换当前url中的after=后面的参数。

这样一来，替换url中的参数就并不是一件简单的事情了。因为网址可能有4种情况：

第一页，没有after参数：https://xxx.com/articlelist?category=technology
第一页，有after参数名但没有值：https://xxx.com/articlelist?category=technology&after=
后续页面，after参数值后面没有内容： https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD
后续页面，aster参数值后面有内容：https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD&other=abc

大家可以试一试，如果用正则表达式，怎么覆盖这4种情况，生成下一页的网址。

实际上，我们不需要使用正则表达式。Python自带的urllib模块已经提供了解决这个问题的方案了。我们先来看一段代码：

from urllib.parse import urlparse, urlunparse, parse_qs, urlencode 
 
 
def replace_field(url, name, value): 
    parse = urlparse(url) 
    query = parse.query 
    query_pair = parse_qs(query) 
    query_pair[name] = value 
    new_query = urlencode(query_pair, doseq=True) 
    new_parse = parse._replace(query=new_query) 
    next_page = urlunparse(new_parse) 
    return next_page 
 
url_list = [ 
    'https://xxx.com/articlelist?category=technology', 
    'https://xxx.com/articlelist?category=technology&after=', 
    'https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD', 
    'https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD&other=abc' 
] 
 
for url in url_list: 
    next_page = replace_field(url, 'after', '0000000') 
    print(next_page)