安装requests库 pip install requests
1 2 3 4 5 import requestsr = requests.get("https://www.baidu.com" ) r.status_code r.encoding("utf-8" ) r.text
百度360关键词提交 1 2 3 4 5 6 7 8 9 10 import requestskv = {'wd' :'Python' } r = requests.get("https://www.baidu.com" ,params = kv) r.status_code r.request.url len (r.text)
安装BeautifulSoup库 pip install beautifulsoup
1 2 3 4 5 6 7 8 9 10 import requestsr = requests.get("https://2.taobao.com/" ) demo = r.text from bs4 import BeautifulSoupsoup = BeautifulSoup(demo,'html.parser' ) print (soup.prettify())
基于BeautifulSoup的函数 find_all(‘a’) 返回html中a标签的所有内容
soup.select(‘.class’)
soup.select(‘#id’)
soup.select(‘.class a’) 某一类下的某个标签中的内容,采用空格隔开
正则匹配
将正则表达式的字符串形式编译成正则表达式对象
1 regex = re.compile (pattern,flags)
查找所有a标签,且属性值href中需要保护关键字“”
1 2 3 4 >>> for x in soup.find_all('a' ,href = re.compile ('lacie' )): print (x) <a class ="sister" href="http://example.com/lacie" id ="link2" >Lacie</a>
HTML标签树
查询嵌套标签 1 2 3 4 5 6 7 8 9 data = requests.get(url) data.encoding = 'utf-8' soup = BeautifulSoup(data.text,"lxml" ) print (soup)title = soup.find_all("ul" ) for i in title: for j in i.find_all('li' ): for l in j.find_all('a' ,href=re.compile ('doc-iicezuev' )): print (l)