pyhton网页抓取,该如何解决

发布时间：2011-06-29 19:55:17 文章来源：www.iduyao.cn 采编人员：星星草

pyhton网页抓取
在wiktionary 这个网页上，将 Frequency lists as of 2006-04-16: 里几个页面中的4万个单词和对应词频都抓取下来，生成一个文本文件，格式为两列，
第1列为词频，第2列为单词：

譬如这样
56271872 the
33950064 and

页面在 http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_Gutenberg
我刚刚学习python，谢谢各位大牛！麻烦加一下注释，感激不尽

------解决方案--------------------

Python code


#coding=utf8
import urllib2,re,pprint
#from BeautifulSoup import BeautifulSoup

#从网站获取页面
def gethtml(url):
    try:
        opener = urllib2.build_opener()
        opener.addheaders = [('User-agent', 'MyWikipediaInfoScraper/0.1')]
        page = opener.open( url ).read() 
        return page
    except Exception,ex:
        print ex
        return -1

#测试时从本地文件读取
def readlocalfile():
    try:
        filename='./res/wikipage1.txt'
        fh=open(filename,'r')
        c=fh.read()
        fh.close()
        return c    
    except Exception,ex:
        print ex
        return -1

#提取数据
def getdata(html):
    try:
        wordlists=[]
        r=re.compile(r'''(<a href="/wiki/[a-z]*"\s*title="[a-z]*">[a-z]*</a>\s*=\s*[0-9]*)''')
        datas=r.findall(html)
        for data in datas:
            #print data
            pos1=data.find('title="')
            if pos1==-1:
                continue
            pos2=data.find('"',pos1+7)
            if pos2==-1:
                continue
            word=data[pos1+7:pos2]
            
            rnumbers=re.compile(r'''(\s*[0-9]+\s*)''')
            number=rnumbers.findall(data)[0].strip()
            temp=[]
            temp.append(number)
            temp.append(word)
            wordlists.append(temp)
            
        datafile='./res/wikidata.txt'
        fh=open(datafile,'w')
        pprint.pprint(wordlists,fh)
        fh.close()
            
    except Exception,ex:
        print ex
        return -1


def getwikidata():
    try:
        url = "http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/1-10000"
        html=gethtml(url)
        #html=readlocalfile()
        if html==-1:
            return -2
        datas=getdata(html)
        if datas==-1:
            return -2        
        
    except Exception,ex:
        print ex
        return -1
    
if __name__=='__main__':
    print 'begin...'
    ret=getwikidata()
    if ret==-1 or ret==-2:
        print 'end. but have some errors'
    else:
        print 'end. ok'

上一篇：python有何大用解决方法
下一篇：这段脚本中，行前面的“点”是什么意思？解决方案

友情提示：
信息收集于互联网，如果您发现错误或造成侵权，请及时通知本站更正或删除，具体联系方式见页面底部联系我们，谢谢。

其他相似内容：

能不能找到支持 python 2.6 2.7 3.x 版本的 mod_python 呢？解决方法

能不能找到支持 python 2.6 2.7 3.x 版本的 mod_python 呢？ http://archive.apache.org/dist/httpd/modpython/win/3.3.1/ 我在...
windows下安装apache + python + django + mod_wsgi.so解决思路

windows下安装apache + python + django + mod_wsgi.so 对应版本： Apache：Apache HTTP Server (httpd) 2.2.19 Python：Python2.7 Dja...
手工执行python3程序没有关问题，放在cron里面就不执行有中文的代码，高手帮忙啊

手工执行python3程序没问题，放在cron里面就不执行有中文的代码，高手帮忙啊。急！先介绍下基本情况环境: redhat Python3.2 目的：定...
PYTHON用什么编辑器？该怎么解决

PYTHON用什么编辑器？是用自带的IDLE不? ------解决方案-------------------- 看下国外的Python用户都用什么吧： http://jobs.pyth...
为什么在python25中输入下面的代码是异常的？求大神

为什么在python25中输入下面的代码是错误的？？求大神！ if 1 < 0: print '”x” must be atleast 0!' ------解决方案-----...
pyhthon zipfile获取压缩文件列表后怎样打开其中某个文件？该如何处理

pyhthon zipfile获取压缩文件列表后怎样打开其中某个文件？如题。似乎ZipFile没有open操作.. zCmfile = zipfile.ZipFile(target...
本人初学者一个，哪位大神帮小弟我解释一下下面两段

本人菜鸟一个，哪位大神帮我解释一下下面两段 import sys print >> sys.stderr, 'Fatal error: invalid input!' import sys ...
老王的python学习网站！推荐！该怎么处理

老王的python学习网站！推荐！ http://blog.csdn.net/hendom/article/details/7173207 很不错的python学习网站。 http://www.cnpyt...
myeclipse里导入python项目,该怎么处理

myeclipse里导入python项目初学python，我在myeclipse里导入已有项目，选择路径后为什么没出现该项目，这项目不应该有问题啊，我用的是m...
安装PyQt的有关问题

安装PyQt的问题？今天在ubuntu下安装了PyQt-x11-gpl-4.9，但是我按照《getting started with PyQt》上的一个例子 import sys from ...

pyhton网页抓取,该如何解决

其他相似内容：

热门推荐：