Python爬虫网站的代码怎么写

发布时间：2022-01-13 09:28:41 阅读：211 作者：iii 栏目：大数据

Python开发者专用服务器限时活动，0元免费领，库存有限，领完即止！点击查看>>

这篇文章主要介绍了Python爬虫网站的代码怎么写的相关知识，内容详细易懂，操作简单快捷，具有一定借鉴价值，相信大家阅读完这篇Python爬虫网站的代码怎么写文章都会有所收获，下面我们一起来看看吧。

import requestsimport jsonimport osimport timeimport randomimport jiebafrom wordcloud import WordCloudfrom imageio import imreadcomments_file_path = 'jd_comments.txt'def get_jd_comments(page = 0):        #获取jd评论        url ='https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=1340204&score=0&sortType=5&page=%s&pageSize=10&isShadowSku=0&fold=1'%page        headers = {            #从哪个页面发出的数据申请，每个网站都是不一样的            'referer': 'https://item.jd.com/1340204.html',            #'user-agent'指的是用户代理，也就是让网站知道你用的哪个浏览器登录的            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',            #哪一类用户想要看数据，是游客还是会员,建议使用登录后的            'cookie': '__jdu=1766075400; areaId=27; PCSYCityID=CN_610000_610100_610113; shshshfpa=a9dc241f-78b8-f3e1-edab-09485009987f-1585747224; shshshfpb=dwWV9IhxtSce3DU0STB1%20TQ%3D%3D; jwotest_product=99; unpl=V2_ZzNtbRAAFhJ3DUJTfhFcUGIAE1RKU0ZCdQoWU3kQXgcwBxJdclRCFnQUR1FnGF8UZAMZWEpcRhFFCEdkeBBVAWMDE1VGZxBFLV0CFSNGF1wjU00zQwBBQHcJFF0uSgwDYgcaDhFTQEJ2XBVQL0oMDDdRFAhyZ0AVRQhHZHsfWwJmBRZYQ1ZzJXI4dmR9EFoAYjMTbUNnAUEpDURSeRhbSGcFFVpDUUcQdAl2VUsa; __jdv=76161171|baidu-pinzhuan|t_288551095_baidupinzhuan|cpc|0f3d30c8dba7459bb52f2eb5eba8ac7d_0_cfd63456491d4208954f13a63833f511|1585835385193; __jda=122270672.1766075400.1585747219.1585829967.1585835353.3; __jdc=122270672; 3AB9D23F7A4B3C9B=AXAFRBHRKYDEJAQ4SPJBVU4J4TI6OQHDFRDGI7ISQFUQGA6OZOQN52T3QYSRWPSIHTFRYRN2QEG7AMEV2JG6NT2DFM; shshshfp=03ed62977bfa44b85be24ef65fbd9b87; ipLoc-djd=27-2376-4343-53952; JSESSIONID=51895EFB4EBD95BA3B3ADAC8C6C73CD8.s1; shshshsID=d2435956e0c158fa7db1980c3053033d_15_1585836826172; __jdb=122270672.16.1766075400|3.1585835353'        }        try:            response = requests.get(url, headers = headers)        except:            print('something wrong!')        #获取json格式数据集        comments_json = response.text[20:-2]        #将获取到的json数据集转换为json对象        comments_json_obj = json.loads(comments_json)        #获取comments里面全部的内容        comments_all = comments_json_obj['comments']        for comment in comments_all:            with open(comments_file_path, 'a+', encoding = 'utf-8') as fin:                fin.write(comment['content'] + '\n')            print(comment['content'])def batch_jd_comments():    #每次写数据之前先清空    if os.path.exists(comments_file_path):        os.remove(comments_file_path)    #我们指定page i的值时，它就可以获取固定页面的评论。    for i in range(30):        print('正在爬取'+str(i+1)+'页的数据....')        get_jd_comments(i)        #设置time用来模拟用户浏览，防止因为爬取太频繁导致ip被封。        time.sleep(random.random()*5)#对获取到的数据进行分词def cut_comments():    with open(comments_file_path, encoding='utf-8')as file:        comment_text = file.read()        wordlist = jieba.lcut_for_search(comment_text)        new_wordlist = ' '.join(wordlist)        return new_wordlist#引入图片byt.jpg来制作相同形状的词云图def create_word_cloud():    mask = imread('byt.jpg')    wordcloud = WordCloud(font_path='msyh.ttc',mask = mask).generate(cut_comments())    wordcloud.to_file('picture.png')if __name__ == '__main__':    create_word_cloud()

关于“Python爬虫网站的代码怎么写”这篇文章的内容就介绍到这里，感谢各位的阅读！相信大家对“Python爬虫网站的代码怎么写”知识都有一定的了解，大家如果还想学习更多知识，欢迎关注亿速云行业资讯频道。

亿速云「云服务器」，即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘，价格低至29元/月。点击查看>>

向AI问一下细节

Python爬虫网站的代码怎么写

猜你喜欢

最新资讯

相关推荐

开发者交流群：

相关标签