400字范文 > 简易Python脚本爬取我爱我家网站信息

简易Python脚本爬取我爱我家网站信息

时间：2021-11-07 15:36:32

最近杭州房价涨得好凶，要不要跟风买房，确实是个头疼的问题，不过做点准备总是没坏处的。前段时间我找了一个我爱我家的中介了解了下情况，他提到我爱我家官网，说上面信息的时效性和准确度都不错，可以时常关注一下。本着程序员的天性，一切可以用脚本偷懒的事情就都不要麻烦自己动手了，于是就写了一个脚本，用于监测我爱我家官网的消息变动，有新的房源信息就发短信给自己。

首先分析一下可行性，爬取网站，取得HTML页面的信息当然是没什么难度的，接下来就是从中整理出有用的信息，然后发短信给自己了。

发送短信的服务，搜索了一下有好几家，但很多都是套餐包，最便宜的也要几百块，作为我这种非刚需用户来说有点贵了。挑来挑去，最终我选了阿里大于作为短信发送平台，因为阿里大于是按条数收费的，按需收费，每条4.5分，而且是淘宝账号登录，支付宝付款，也比较方便。

Talk is cheap, show me the code. 先上代码。以之江一号小区为例，如果要搜索其他小区，只要修改url的最后一个字段即可。

wiwj.py文件

1 # -*- coding: utf-8 -*- 2 3 import urllib2 4 import re 5 import sms 6 import logging 7 logging.basicConfig(level=logging.DEBUG, 8 filename='log.log' 9 ) 10 11 12 def del_html_mark(content): 13inner_pattern = pile('<.*?>') 14return re.sub(inner_pattern, '', content) 15 16 17 def del_additional_mark(content): 18inner_pattern = pile('&nbsp') 19content = re.sub(inner_pattern, '', content) 20inner_pattern = pile('[;\t\n ]') 21content = re.sub(inner_pattern, '', content) 22return content 23 24 25 def handle_content(content): 26content = del_html_mark(content) 27content = del_additional_mark(content) 28return content 29 30 31 url = '/exchange/_%E4%B9%8B%E6%B1%9F%E4%B8%80%E5%8F%B7' 32 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 33 headers = {'User-Agent': user_agent} 34 html_content = '' 35 logging.info('Starting scanning...') 36 try: 37request = urllib2.Request(url, headers=headers) 38response = urllib2.urlopen(request) 39html_content = response.read() 40# print html_content 41 except urllib2.URLError, e: 42if hasattr(e, "code"): 43 print e.code 44if hasattr(e, "reason"): 45 print e.reason 46 47 content = html_content.decode('utf-8') 48 pattern = pile('<div class="list-info">(.*?)</div>', re.S) 49 items = re.findall(pattern, content) 50 for item in items: 51title_pattern = pile('<h2><a href=.*?>(.*?)</a></h2>') 52title_list = re.findall(title_pattern, item) 53title = title_list[0] 54title = handle_content(title) 55# print title 56community_pattern = pile('<a href=.*? target="_blank"><h3 >(.*?)</h3></a>') 57community_list = re.findall(community_pattern, item) 58community = community_list[0] 59community = handle_content(community) 60# print community 61detail_pattern = pile('<li class="font-balck">(.*?)</li>') 62detail_list = re.findall(detail_pattern, item) 63detail = detail_list[0] 64detail = handle_content(detail) 65# print detail 66price_pattern = pile('<h3>(.*?)</em></h3>') 67price_list = re.findall(price_pattern, item) 68price = price_list[0] 69price = handle_content(price) 70# print price 71size_pattern = pile('<p>(.*?)/.*?</p>') 72size_list = re.findall(size_pattern, item) 73size = size_list[0] 74size = handle_content(size) 75total_msg = { 76 'title': title.encode('utf-8'), 77 'community': community.encode('utf-8'), 78 'detail': detail.encode('utf-8'), 79 'price': price.encode('utf-8'), 80 'size': size.encode('utf-8') 81} 82file = open('record.log', 'r') 83old_info_list = file.readlines() 84for old_info in old_info_list: 85 old_info = del_additional_mark(old_info) 86file.close() 87is_old_message = False 88for old_info in old_info_list: 89 if old_info == (total_msg['title'] + '\n'): 90 is_old_message = True 91 break 92if is_old_message: 93 continue 94file = open('record.log', 'a') 95print total_msg 96sms.send_msg(total_msg, '我的手机号码') 97file.write(total_msg['title'] + '\n') 98file.close() 99 logging.info(total_msg['title'])100 logging.info(total_msg['detail'])101 logging.info(total_msg['community'])102 logging.info(total_msg['size'])103 logging.info(total_msg['price'])104 logging.info('Finished scanning.')105 file.close()

sms.py文件

1 # -*- coding: utf-8 -*- 2 3 import top.api 4 5 6 def send_msg(msg, tel): 7app_key = "******" 8secret_key = "******" 9 10req = top.api.AlibabaAliqinFcSmsNumSendRequest()11req.set_app_info(top.appinfo(app_key, secret_key))12 13req.extend = ""14req.sms_type = "normal"15req.sms_free_sign_name = "******"16req.sms_param = "{'position':'" + msg['community'] + "','temperature':'" + msg['size'] + \17 "','detail':'" + msg['price'] + "'}"18req.rec_num = tel19req.sms_template_code = "******"20try :21 resp = req.getResponse()22 print (resp)23except Exception, e:24 print (e)25 26 27 if __name__ == "__main__":28send_msg('测试文本', '测试电话号码')

wiwj.py是主文件，基本原理是将Python脚本伪装成浏览器，对特定页面进行爬取，然后将获取到的HTML文本信息用正则表达式取出需要的字段，包括房屋总价、面积、单价、描述等。

sms.py文件中接入了阿里大于的SDK top.api，通过这两句引入，然后根据申请的模板信息，设定一些参数

req = top.api.AlibabaAliqinFcSmsNumSendRequest()req.set_app_info(top.appinfo(app_key, secret_key))

sms.py文件中我申请的模板的信息在这里用******替代了，读者如果需要，可以去申请一个模板。

因为我爱我家官网上的信息是时常变动的，即使是老信息，也可能突然更新一下，显示在最前面，而且网站上并没有房屋位置的详细信息，所以无法直接判定某条刚更新过的信息是不是之前就有的。我采取的判断一条信息是否是老信息的方法是，维护一个log文件，将每次爬取到的房产信息都写在这个log文件中，每次爬取到新的信息，就跟log文件中已有的信息进行对比，如果总价、单价和描述都相同，就认为这是同一个房子，不发送短信提醒。如果不完全相同，说明这就是新房子，发送短信提醒。

补充一句，有读者可能会对这一句感到奇怪

req.sms_param = "{'position':'" + msg['community'] + "','temperature':'" + msg['size'] + \17 "','detail':'" + msg['price'] + "'}"

这是因为阿里大于的短信发送模板是需要进行审核的，我之前提审了几个模板，其中有房产相关的文字，最终处理结果都是审批不通过，最后只好伪装成发送天气预报信息的短信模板，这样才过了审核。不过阿里大于比较赞的一点是，它会把审核不通过的原因写出来，这样就可以有针对性地进行修改，然后再次提审。

最后，将这些脚本放到我的个人VPS上(树莓派也可以)，并设定了crontab定时任务，

*/10 * * * * python ~/Documents/Fang/wiwj.py

每10分钟执行一次，实时度基本可以接受。

就这样，大功告成 XD

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。