python爬虫学习笔记

Posted by Liao on 2019-09-22

安装requests库

pip install requests

1
2
3
4
5
import requests
r = requests.get("https://www.baidu.com")
r.status_code #返回状态码
r.encoding("utf-8")
r.text #打印出首页内容

百度360关键词提交

1
2
3
4
5
6
7
8
9
10
import requests
kv = {'wd':'Python'} #根据python关键字查询
r = requests.get("https://www.baidu.com",params = kv)
r.status_code

r.request.url #发给百度的url信息
#返回:'https://www.baidu.com/?wd=Python'
len(r.text)


安装BeautifulSoup库

pip install beautifulsoup

1
2
3
4
5
6
7
8
9
10
import requests
r = requests.get("https://2.taobao.com/")
demo = r.text

from bs4 import BeautifulSoup

soup = BeautifulSoup(demo,'html.parser') #html解释器
print(soup.prettify()) #打印出HTML的源代码


基于BeautifulSoup的函数

find_all(‘a’) 返回html中a标签的所有内容

soup.select(‘.class’)

soup.select(‘#id’)

soup.select(‘.class a’) 某一类下的某个标签中的内容,采用空格隔开

正则匹配

将正则表达式的字符串形式编译成正则表达式对象

1
regex = re.compile(pattern,flags)

查找所有a标签,且属性值href中需要保护关键字“”

1
2
3
4
>>> for x in soup.find_all('a',href = re.compile('lacie')):
print(x)

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

HTML标签树

查询嵌套标签

1
2
3
4
5
6
7
8
9
data = requests.get(url)
data.encoding = 'utf-8'
soup = BeautifulSoup(data.text,"lxml")
print(soup)
title = soup.find_all("ul")
for i in title:
for j in i.find_all('li'):
for l in j.find_all('a',href=re.compile('doc-iicezuev')):
print(l)