您现在的位置是:首页 > 学无止境 > Python 网站首页学无止境

利用scrapy爬取知乎信息(一)

0、前言


最近学习了一下scrapy框架,为了加深对scrapy的认识,就利用其爬取了知乎用户信息、问题、回答、文章等数据,并将其存入mongoDB之中。在这里把大致的过程记录一下。

项目依赖:scrapy、redis、pymongo

运行环境:python3.6


1、新建项目


在项目存放地址新建一个scrapy项目,运行下列命令

scrapy startproject zhihuSpider

这是在当前目录下就会新建一个项目文件夹,名称为zhihuSpider。然后进入项目文件夹,运行命令新建一个spider

scrapy genspider zhihu http://www.zhihu.com/

现在我们就新建好了一个Scrapy项目,可以进行接下来的操作。此时的文件目录为

zhihuSpider
├── scrapy.cfg
└── zhihuSpider
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── zhihu.py


2、设计数据储存结构


现在我们可以开始正式编写爬虫了,首先我们需要明确自己需要爬取哪些信息,这个问题我们在items.py里面解决,Item是保存爬取数据的容器,我们可以在Item里写好我们需要保存的信息。如下所示:

class UserItem(scrapy.Item):
    name = scrapy.Field()      # 姓名
    description = scrapy.Field()  # 个人简介
    headline = scrapy.Field()  # 标题
    gender = scrapy.Field()  # 性别
    follower_count = scrapy.Field()  # 关注者人数
    following_count = scrapy.Field()  # 关注了人数
    educations = scrapy.Field()  # 教育
    employments = scrapy.Field()
    avatar_url = scrapy.Field()  # 头像
    business = scrapy.Field()  # 所在行业
    locations = scrapy.Field()  # 居住地
    url_token = scrapy.Field()  # url参数

class RelationItem(scrapy.Item):
    from_id = scrapy.Field()  # 关系发出人
    to_id = scrapy.Field()  # 关系接收人
    relation_type = scrapy.Field()  # 关系类型


class AnswerItem(scrapy.Item):
    answer_user_name = scrapy.Field()  # 该条回答回答者的昵称
    question_name = scrapy.Field()  # 问题内容
    content = scrapy.Field()  # 回答的内容
    question_id = scrapy.Field()  # 问题的id
    voteup_count = scrapy.Field()  # 赞成数
    answer_id = scrapy.Field()  # 该条回答的id
    comment_count = scrapy.Field()  # 评论数


class QuestionItem(scrapy.Item):
    name = scrapy.Field()  # 问题内容
    author = scrapy.Field()  # 问题作者
    id = scrapy.Field()  # 问题id
    answer_count = scrapy.Field()  # 回答数目
    follower_count = scrapy.Field()  # 关注数目
    created = scrapy.Field()  # 问题创建时间

编写好了Item之后,我们就可以以该形式保存相应的数据。其格式为dict,如{"name": "Lily"}。

现在我们设计好了数据存储的格式,就可以正式开始写爬虫爬取数据的部分了。


3、编写spider


打开zhihu.py,可以看到有一个初始代码:

class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['https://www.zhihu.com']
    start_urls = ['http://https://www.zhihu.com/']

    def parse(self, response):
        pass

我们的思路就是从一个用户的个人界面做为入口,爬去该用户的问题、回答、文章、关注的人、粉丝和个人信息,然后将关注的人和粉丝作为下一个爬取对象,就这样像一张大网一样发散开,达到我们的爬取目的。

所以我们在zhihu.py里面,我们先添加一个start_user_ids:

start_user_ids = ['qing-ci-53-14', 'qiao-yi-miao-73', 'changeclub']

这个信息我们可以从用户个人界面的url中获取到,然后我们检查一下个人信息页面的加载过程,可以发现这个过程调用的api,我们就以此构成url来获取数据。通过api获取数据的好处在于我们不需要对页面进行解析、提取,api返回的数据已经是构造好了的数据格式,我们只要从中直接取出我们需要的一部分便可。

我们重写一下ZhihuSpider类下面的start_requests函数,构建一个爬虫入口,回调函数为parse()

    def start_requests(self):
        for one in self.start_user_ids:
            yield scrapy.Request('https://www.zhihu.com/api/v4/members/' + one +'?include=locations,employments,industry_category,gender,educations,business,follower_count,following_count,description,badge[?(type=best_answerer)].topics', callback=self.parse)

然后在parse里面,我们就可以对响应进行信息提取,然后保存在我们编写好的Item之中

    def parse(self, response):
        result = eval(str(response.body, encoding="utf8").replace('false', '0').replace('true', '1'))   # 将页面数据转化成dict类型,然后将false替换成0,将true替换成1

这样,我们就把响应内容转化成了dict类型,就可以直接通过键来取值,下面就是把相应的值写入我们的Item之中

        item = UserItem()   # 指明我们要用到的Item
        item['name'] = result.get('name')
        item['url_token'] = result.get('url_token')
        item['description'] = result.get('description')
        item['avatar_url'] = result.get("avatar_url")
        item['follower_count'] = result.get('follower_count')
        item['following_count'] = result.get('following_count')
        item['headline'] = result.get('headline')
        if result.get('gender') == 1:
            item['gender'] = '男'
        elif result.get('gender') == 0:
            item['gender'] = '女'
        else:
            item['gender'] = '未知'
        item['locations'] = []
        for location in result.get('locations'):
            item['locations'].append(location['name'])
        item['educations'] = []
        # 由于教育经历有多种形式,所以我们需要对其进行一下异常处理
        for education in result.get('educations'):
            try:
                item['educations'].append(education['school']['name'] + ':' + education['major']['name'])
            except KeyError:
                try:
                    item['educations'].append(education['school']['name'])
                except KeyError:
                    continue
        # 同理,employment也有多种形式,所以我们需要对其进行一下异常处理
        item['employments'] = []
        for employment in result.get('employments'):
            try:
                item['employments'].append(employment['company']['name'] + ':' + employment['job']['name'])
            except KeyError:
                try:
                    item['employments'].append(employment['company']['name'])
                except KeyError:
                    continue
        try:
            item['business'] = result['business']['name']
        except KeyError:
            item['business'] = ''
        yield item

就这样,我们就处理好了用户的个人信息,接下来,我们就要通过一个用户的个人信息发散来爬取其他的信息。在这里,我们可以在浏览器打开检查元素,然后分别点击文章、回答、提问、关注了、关注者,就可以看到这些页面所调用的API,我们也是借用这些API来获取信息。在parse()中继续添加以下代码

        user_id = result.get('url_token')
        # 构造爬取回答的请求,回调函数为parse_answers,其中我们借用meta来传递当前爬去用户的信息
        yield scrapy.Request(
            'https://www.zhihu.com/api/v4/members/' + user_id + '/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,collapsed_by,suggest_edit,comment_count,can_comment,content,voteup_count,reshipment_settings,comment_permission,mark_infos,created_time,updated_time,review_info,question,excerpt,relationship.is_authorized,voting,is_author,is_thanked,is_nothelp;data[*].author.badge[?(type=best_answerer)].topics',
            callback=self.parse_answer, meta={'name': item['name']})
        # 构造爬取问题的请求,回调函数为parse_question,其中我们借用meta来传递当前爬去用户的信息
        yield scrapy.Request(
            'https://www.zhihu.com/api/v4/members/' + user_id + '/questions?include=data[*].created,answer_count,follower_count,author,admin_closed_comment',
            callback=self.parse_question, meta={"author": item['name']})
        # 构造爬取文章的请求,回调函数为parse_article,其中我们借用meta来传递当前爬去用户的信息
        yield scrapy.Request(
            'https://www.zhihu.com/api/v4/members/' + user_id + '/articles?include=data[*].comment_count,suggest_edit,is_normal,thumbnail_extra_info,thumbnail,can_comment,comment_permission,admin_closed_comment,content,voteup_count,created,updated,upvoted_followees,voting,review_info;data[*].author.badge[?(type=best_answerer)].topics',
            callback=self.parse_article, meta={'author': item['name']})
        # 构造爬取关注了的请求,回调函数为parse_follow,其中我们借用meta来传递当前爬去用户的信息
        yield scrapy.Request(
            'https://www.zhihu.com/api/v4/members/' + user_id + '/followees?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics',
            callback=self.parse_follow, meta={'type': 'followee', 'id': item['url_token']})
        # 构造爬取关注者的请求,回调函数为parse_follow,其中我们借用meta来传递当前爬去用户的信息
        yield scrapy.Request(
            'https://www.zhihu.com/api/v4/members/' + user_id + '/followers?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics',
            callback=self.parse_follow, meta={'type': 'follower', 'id': item['url_token']})

此时,我们的parse()函数就完成了,此时我们就可以来编写相对应的回调函数来处理数据和构造新的请求了。思路和parse()函数一样,将请求响应内容转为成dict类型,就可以直接提取了

    def parse_follow(self, response):
        result = eval(str(response.body, encoding="utf8").replace('false', '0').replace('true', '1'))
        item = RelationItem()
        item['from_id'] = response.meta['id']
        item['relation_type'] = response.meta['type']
        item['to_id'] = []
        # 将每一个其他的用户url_token存入to_id之中
        for data in result['data']:
            item['to_id'].append(data['url_token'])
        yield item
        # 如果存在下一页,那么再次构造一个请求,回调函数依然为parse_follow
        if result['paging']['is_end'] == 0:
            yield scrapy.Request(result['paging']['next'], callback=self.parse_follow, meta={'type': 'followee', 'id': item['from_id']})
        # 然后遍历item['to_id']中的元素,因为每一个元素就代表着一个新的用户,我们就可以构造一个新的爬取请求,又回到了最初的用户信息爬取,回调函数为parse(),这样就构成了一棵爬取树
        for one in item['to_id']:
            yield scrapy.Request('https://www.zhihu.com/api/v4/members/' + one +'?include=locations,employments,industry_category,gender,educations,business,follower_count,following_count,description,badge[?(type=best_answerer)].topics', callback=self.parse)

    def parse_answer(self, response):
        results = eval(str(response.body, encoding="utf8").replace('false', '0').replace('true', '1'))
        name = response.meta['name']
        for result in results['data']:
            item = AnswerItem()
            item['answer_user_name'] = name
            item['question_name'] = result['question']['title']
            item['content'] = result['content']
            item['question_id'] = result['question']['id']
            item['voteup_count'] = result['voteup_count']
            item['comment_count'] = result['comment_count']
            item['answer_id'] = result['id']
            yield item
        # 如果存在下一页,那么再次构造一个请求,回调函数依然为parse_follow
        if results['paging']['is_end'] == 0:
            yield scrapy.Request(results['paging']['next'], callback=self.parse_answer, meta={'name': name})

    def parse_question(self, response):
        results = eval(str(response.body, encoding="utf8").replace('false', '0').replace('true', '1'))
        author = response.meta['author']
        for result in results['data']:
            item = QuestionItem()
            item['name'] = result['title']
            item['author'] = author
            item['created'] = result['created']
            item['answer_count'] = result['answer_count']
            item['follower_count'] = result['follower_count']
            item['id'] = result['id']
            yield item
        # 如果存在下一页,那么再次构造一个请求,回调函数依然为parse_follow
        if results['paging']['is_end'] == 0:
            yield scrapy.Request(results['paging']['next'], callback=self.parse_question, meta={'author': author})

    def parse_article(self, response):
        results = eval(str(response.body, encoding="utf8").replace('false', '0').replace('true', '1'))
        author = response.meta['author']
        for result in results['data']:
            item = ArticleItem()
            item['title'] = result['title']
            item['author'] = author
            item['created'] = result['created']
            item['id'] = result['id']
            item['voteup_count'] = result['voteup_count']
            item['comment_count'] = result['comment_count']
            item['content'] = result['content']
            yield item
        # 如果存在下一页,那么再次构造一个请求,回调函数依然为parse_follow
        if results['paging']['is_end'] == 0:
            yield scrapy.Request(results['paging']['next'], callback=self.parse_article, meta={'author': author})

至此,我们的Spider部分编写完成,此时项目已经可以开始运行。下一部分将介绍将数据存储至mongoDB之中,以及请求头、代理等中间件的使用。


版权声明:本文为博主原创文章,转载时请注明来源。https://blog.thinker.ink/passage/17/

 

文章评论

Top