<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    Python, Java, Life, etc

    A blog of technology and life.

    BlogJava 首頁 新隨筆 聯(lián)系 聚合 管理
      30 Posts :: 0 Stories :: 9 Comments :: 0 Trackbacks
    Content syndication for the Web

    Level: Introductory


    Mike Olson (mike.olson@fourthought.com), Principal Consultant, Fourthought, Inc.
    Uche Ogbuji (uche.ogbuji@fourthought.com), Principal Consultant, Fourthought, Inc.

    13 Nov 2002

    Column iconRSS is one of the most successful XML services ever. Despite its chaotic roots, it has become the community standard for exchanging content information across Web sites. Python is an excellent tool for RSS processing, and Mike Olson and Uche Ogbuji introduce a couple of modules available for this purpose.

    RSS is an abbreviation with several expansions: "RDF Site Summary," "Really Simple Syndication," "Rich Site Summary," and perhaps others. Behind this confusion of names is an astonishing amount of politics for such a mundane technological area. RSS is a simple XML format for distributing summaries of content on Web sites. It can be used to share all sorts of information including, but not limited to, news flashes, Web site updates, event calendars, software updates, featured content collections, and items on Web-based auctions.

    RSS was created by Netscape in 1999 to allow content to be gathered from many sources into the Netcenter portal (which is now defunct). The UserLand community of Web enthusiasts became early supporters of RSS, and it soon became a very popular format. The popularity led to strains over how to improve RSS to make it even more broadly useful. This strain led to a fork in RSS development. One group chose an approach based on RDF, in order to take advantage of the great number of RDF tools and modules, and another chose a more stripped-down approach. The former is called RSS 1.0, and the latter RSS 0.91. Just last month the battle flared up again with a new version of the non-RDF variant of RSS, which its creators are calling "RSS 2.0."

    RSS 0.91 and 1.0 are very popular, and used in numerous portals and Web logs. In fact, the blogging community is a great user of RSS, and RSS lies behind some of the most impressive networks of XML exchange in existence. These networks have grown organically, and are really the most successful networks of XML services in existence. RSS is a XML service by virtue of being an exchange of XML information over an Internet protocol (the vast majority of RSS exchange is simple HTTP GET of RSS documents). In this article, we introduce just a few of the many Python tools available for working with RSS. We don't provide a technical introduction to RSS, because you can find this in so many other articles (see Resources). We recommend first that you gain a basic familiarity with RSS, and that you understand XML. Understanding RDF is not required.

    [We consider RSS an 'XML service' rather than a 'Web service' due to the use of XML descriptions but the lack of use of WSDL. -- Editors]

    RSS.py
    Mark Nottingham's RSS.py is a Python library for RSS processing. It is very complete and well-written. It requires Python 2.2 and PyXML 0.7.1. Installation is easy; just download the Python file from Mark's home page and copy it to somewhere in your PYTHONPATH.

    Most users of RSS.py need only concern themselves with two classes it provides: CollectionChannel and TrackingChannel. The latter seems the more useful of the two. TrackingChannel is a data structure that contains all the RSS data indexed by the key of each item. CollectionChannel is a similar data structure, but organized more as RSS documents themselves are, with the top-level channel information pointing to the item details using hash values for the URLs. You will probably use the utility namespace declarations in the RSS.ns structure. Listing 1 is a simple script that downloads and parses an RSS feed for Python news, and prints out all the information from the various items in a simple listing.



    from RSS import ns, CollectionChannel, TrackingChannel

    #Create a tracking channel, which is a data structure that
    #Indexes RSS data by item URL
    tc = TrackingChannel()

    #Returns the RSSParser instance used, which can usually be ignored
    tc.parse("http://www.python.org/channews.rdf")

    RSS10_TITLE = (ns.rss10, 'title')
    RSS10_DESC = (ns.rss10, 'description')

    #You can also use tc.keys()
    items = tc.listItems()
    for item in items:
    #Each item is a (url, order_index) tuple
    url = item[0]
    print "RSS Item:", url
    #Get all the data for the item as a Python dictionary
    item_data = tc.getItem(item)
    print "Title:", item_data.get(RSS10_TITLE, "(none)")
    print "Description:", item_data.get(RSS10_DESC, "(none)")



    We start by creating a TrackingChannel instance, and then populate it with data parsed from the RSS feed at http://www.python.org/channews.rdf. RSS.py uses tuples as the property names for RSS data. This may seem an unusual approach to those not used to XML processing techniques, but it is actually a very useful way of being very precise about what was in the original RSS file. In effect, an RSS 0.91 title element is not considered to be equivalent to an RSS 1.0 one. There is enough data for the application to ignore this distinction, if it likes, by ignoring the namespace portion of each tuple; but the basic API is wedded to the syntax of the original RSS file, so that this information is not lost. In the code, we use this property data to gather all the items from the news feed for display. Notice that we are careful not to assume which properties any particular item might have. We retrieve properties using the safe form as seen in the code below.



    print "Title:", item_data.get(RSS10_TITLE, "(none)")

    Which provides a default value if the property is not found, rather than this example.



    print "Title:", item_data[RSS10_TITLE]

    This precaution is necessary because you never know what elements are used in an RSS feed. Listing 2shows the output from Listing 1.



    $ python listing1.py
    RSS Item: http://www.python.org/2.2.2/
    Title: Python 2.2.2b1
    Description: (none)
    RSS Item: http://sf.net/projects/spambayes/
    Title: spambayes project
    Description: (none)
    RSS Item: http://www.mems-exchange.org/software/scgi/
    Title: scgi 0.5
    Description: (none)
    RSS Item: http://roundup.sourceforge.net/
    Title: Roundup 0.4.4
    Description: (none)
    RSS Item: http://www.pygame.org/
    Title: Pygame 1.5.3
    Description: (none)
    RSS Item: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
    Title: Pyrex 0.4.4.1
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/hb/
    Title: hb 1.88
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/abck/
    Title: abck 2.2
    Description: (none)
    RSS Item: http://www.terra.es/personal7/inigoserna/lfm/
    Title: lfm 0.9
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/waccess/
    Title: waccess 2.0
    Description: (none)
    RSS Item: http://www.krause-software.de/jinsitu/
    Title: JinSitu 0.3
    Description: (none)
    RSS Item: http://www.alobbs.com/pykyra/
    Title: PyKyra 0.1.0
    Description: (none)
    RSS Item: http://www.havenrock.com/developer/treewidgets/index.html
    Title: TreeWidgets 1.0a1
    Description: (none)
    RSS Item: http://civil.sf.net/
    Title: Civil 0.80
    Description: (none)
    RSS Item: http://www.stackless.com/
    Title: Stackless Python Beta
    Description: (none)

    Of course, you would expect somewhat different output because the news items will have changed by the time you try it. The RSS.py channel objects also provide methods for adding and modifying RSS information. You can write the result back to RSS 1.0 format using the output() method. Try this out by writing back out the information parsed in Listing 1. Kick off the script in interactive mode by running: python -i listing1.py . At the resuting Python prompt, run the following example.



    >>> result = tc.output(items)
    >>> print result

    The result is an RSS 1.0 document printed out. You must have RSS.py, version 0.42 or more recent for this to work. There is a bug in the output() method in earlier versions.

    rssparser.py
    Mark Pilgrim offers another module for RSS file parsing. It doesn't provide all the features and options that RSS.py does, but it does offer a very liberal parser, which deals well with all the confusing diversity in the world of RSS. To quote from the rssparser.py page:

    You see, most RSS feeds suck. Invalid characters, unescaped ampersands (Blogger feeds), invalid entities (Radio feeds), unescaped and invalid HTML (The Register's feed most days). Or just a bastardized mix of RSS 0.9x elements with RSS 1.0 elements (Movable Type feeds).
    Then there are feeds, like Aaron's feed, which are too bleeding edge. He puts an excerpt in the description element but puts the full text in the content:encoded element (as CDATA). This is valid RSS 1.0, but nobody actually uses it (except Aaron), few news aggregators support it, and many parsers choke on it. Other parsers are confused by the new elements (guid) in RSS 0.94 (see Dave Winer's feed for an example). And then there's Jon Udell's feed, with the fullitem element that he just sort of made up.

    It's funny to consider this in the light of the fact that XML and Web services are supposed to increase interoperability. Anyway, rssparser.py is designed to deal with all the madness.

    Installing rssparser.py is also very easy. You download the Python file (see Resources), rename it from "rssparser.py.txt" to "rssparser.py", and copy it to your PYTHONPATH. I also suggest getting the optional timeoutsocket module which improves the timeout behavior of socket operations in Python, and thus can help getting RSS feeds less likely to stall the application thread in case of error.

    Listing 3 is a script that is the equivalent of Listing 1, but using rssparser.py, rather than RSS.py.



    import rssparser
    #Parse the data, returns a tuple: (data for channels, data for items)
    channel, items = rssparser.parse("http://www.python.org/channews.rdf")

    for item in items:
    #Each item is a dictionary mapping properties to values
    print "RSS Item:", item.get('link', "(none)")
    print "Title:", item.get('title', "(none)")
    print "Description:", item.get('description', "(none)")



    As you can see, the code is much simpler. The trade-off between RSS.py and rssparser.py is largely that the former has more features, and maintains more syntactic information from the RSS feed. The latter is simpler, and a more forgiving parser (the RSS.py parser only accepts well-formed XML).

    The output should be the same as in Listing 2.

    Conclusion
    There are many Python tools for RSS, and we don't have space to cover them all. Aaron Swartz's page of RSS tools is a good place to start looking if you want to explore other modules out there. RSS is easy to work with in Python, because of all the great modules available for it. The modules hide all the chaos brought about by the history and popularity of RSS. If your XML services needs mostly involve the exchange of descriptive information for Web sites, we highly recommend using the most successful XML service technology in employment.

    Next month, we will explain how to use e-mail packages for Python for writing Web services over SMTP.

    Resources

    About the authors
    Photo of Mike Olson Mike Olson is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML middleware. You can contact Mr. Olson at mike.olson@fourthought.com.


    Photo of Uche Ogbuji Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche.ogbuji@fourthought.com.

    posted on 2005-02-17 02:48 pyguru 閱讀(352) 評論(0)  編輯  收藏 所屬分類: Build Website
    主站蜘蛛池模板: 亚洲AV无码一区二区一二区| 亚洲国产高清视频在线观看| 最新亚洲人成无码网www电影| 五月亭亭免费高清在线| 亚洲无线电影官网| 免费女人高潮流视频在线观看| 久久精品亚洲综合一品| 国产精品区免费视频| 亚洲国产一区二区三区青草影视| 毛片免费在线观看| 亚洲国产精品久久人人爱| 久久久久久久91精品免费观看| 亚洲一区二区三区成人网站| 国产精品视_精品国产免费| 国产偷国产偷亚洲高清人| 亚洲精品麻豆av| 男女作爱在线播放免费网站| 亚洲黄色在线播放| 啦啦啦手机完整免费高清观看| 污网站在线免费观看| 亚洲精品制服丝袜四区| 成人黄色免费网址| 综合一区自拍亚洲综合图区| 日韩一卡2卡3卡4卡新区亚洲| 最近免费mv在线观看动漫| 亚洲精品国产情侣av在线| 麻豆国产人免费人成免费视频 | a毛片在线还看免费网站| 亚洲国产女人aaa毛片在线| 99精品全国免费观看视频| 久久久亚洲精华液精华液精华液| 亚洲无码精品浪潮| 免费A级毛片无码专区| 亚洲国产精品成人综合色在线| 亚洲乱码中文字幕综合234| 精品一区二区三区免费毛片爱| 亚洲色无码专区一区| 国产精一品亚洲二区在线播放| 在人线av无码免费高潮喷水| 一级毛片视频免费观看| 亚洲成a人片在线观看中文!!! |