<rt id="bn8ez"></rt>
<label id="bn8ez"></label>

  • <span id="bn8ez"></span>

    <label id="bn8ez"><meter id="bn8ez"></meter></label>

    Python, Java, Life, etc

    A blog of technology and life.

    BlogJava 首頁(yè) 新隨筆 聯(lián)系 聚合 管理
      30 Posts :: 0 Stories :: 9 Comments :: 0 Trackbacks
    Content syndication for the Web

    Level: Introductory


    Mike Olson (mike.olson@fourthought.com), Principal Consultant, Fourthought, Inc.
    Uche Ogbuji (uche.ogbuji@fourthought.com), Principal Consultant, Fourthought, Inc.

    13 Nov 2002

    Column iconRSS is one of the most successful XML services ever. Despite its chaotic roots, it has become the community standard for exchanging content information across Web sites. Python is an excellent tool for RSS processing, and Mike Olson and Uche Ogbuji introduce a couple of modules available for this purpose.

    RSS is an abbreviation with several expansions: "RDF Site Summary," "Really Simple Syndication," "Rich Site Summary," and perhaps others. Behind this confusion of names is an astonishing amount of politics for such a mundane technological area. RSS is a simple XML format for distributing summaries of content on Web sites. It can be used to share all sorts of information including, but not limited to, news flashes, Web site updates, event calendars, software updates, featured content collections, and items on Web-based auctions.

    RSS was created by Netscape in 1999 to allow content to be gathered from many sources into the Netcenter portal (which is now defunct). The UserLand community of Web enthusiasts became early supporters of RSS, and it soon became a very popular format. The popularity led to strains over how to improve RSS to make it even more broadly useful. This strain led to a fork in RSS development. One group chose an approach based on RDF, in order to take advantage of the great number of RDF tools and modules, and another chose a more stripped-down approach. The former is called RSS 1.0, and the latter RSS 0.91. Just last month the battle flared up again with a new version of the non-RDF variant of RSS, which its creators are calling "RSS 2.0."

    RSS 0.91 and 1.0 are very popular, and used in numerous portals and Web logs. In fact, the blogging community is a great user of RSS, and RSS lies behind some of the most impressive networks of XML exchange in existence. These networks have grown organically, and are really the most successful networks of XML services in existence. RSS is a XML service by virtue of being an exchange of XML information over an Internet protocol (the vast majority of RSS exchange is simple HTTP GET of RSS documents). In this article, we introduce just a few of the many Python tools available for working with RSS. We don't provide a technical introduction to RSS, because you can find this in so many other articles (see Resources). We recommend first that you gain a basic familiarity with RSS, and that you understand XML. Understanding RDF is not required.

    [We consider RSS an 'XML service' rather than a 'Web service' due to the use of XML descriptions but the lack of use of WSDL. -- Editors]

    RSS.py
    Mark Nottingham's RSS.py is a Python library for RSS processing. It is very complete and well-written. It requires Python 2.2 and PyXML 0.7.1. Installation is easy; just download the Python file from Mark's home page and copy it to somewhere in your PYTHONPATH.

    Most users of RSS.py need only concern themselves with two classes it provides: CollectionChannel and TrackingChannel. The latter seems the more useful of the two. TrackingChannel is a data structure that contains all the RSS data indexed by the key of each item. CollectionChannel is a similar data structure, but organized more as RSS documents themselves are, with the top-level channel information pointing to the item details using hash values for the URLs. You will probably use the utility namespace declarations in the RSS.ns structure. Listing 1 is a simple script that downloads and parses an RSS feed for Python news, and prints out all the information from the various items in a simple listing.



    from RSS import ns, CollectionChannel, TrackingChannel

    #Create a tracking channel, which is a data structure that
    #Indexes RSS data by item URL
    tc = TrackingChannel()

    #Returns the RSSParser instance used, which can usually be ignored
    tc.parse("http://www.python.org/channews.rdf")

    RSS10_TITLE = (ns.rss10, 'title')
    RSS10_DESC = (ns.rss10, 'description')

    #You can also use tc.keys()
    items = tc.listItems()
    for item in items:
    #Each item is a (url, order_index) tuple
    url = item[0]
    print "RSS Item:", url
    #Get all the data for the item as a Python dictionary
    item_data = tc.getItem(item)
    print "Title:", item_data.get(RSS10_TITLE, "(none)")
    print "Description:", item_data.get(RSS10_DESC, "(none)")



    We start by creating a TrackingChannel instance, and then populate it with data parsed from the RSS feed at http://www.python.org/channews.rdf. RSS.py uses tuples as the property names for RSS data. This may seem an unusual approach to those not used to XML processing techniques, but it is actually a very useful way of being very precise about what was in the original RSS file. In effect, an RSS 0.91 title element is not considered to be equivalent to an RSS 1.0 one. There is enough data for the application to ignore this distinction, if it likes, by ignoring the namespace portion of each tuple; but the basic API is wedded to the syntax of the original RSS file, so that this information is not lost. In the code, we use this property data to gather all the items from the news feed for display. Notice that we are careful not to assume which properties any particular item might have. We retrieve properties using the safe form as seen in the code below.



    print "Title:", item_data.get(RSS10_TITLE, "(none)")

    Which provides a default value if the property is not found, rather than this example.



    print "Title:", item_data[RSS10_TITLE]

    This precaution is necessary because you never know what elements are used in an RSS feed. Listing 2shows the output from Listing 1.



    $ python listing1.py
    RSS Item: http://www.python.org/2.2.2/
    Title: Python 2.2.2b1
    Description: (none)
    RSS Item: http://sf.net/projects/spambayes/
    Title: spambayes project
    Description: (none)
    RSS Item: http://www.mems-exchange.org/software/scgi/
    Title: scgi 0.5
    Description: (none)
    RSS Item: http://roundup.sourceforge.net/
    Title: Roundup 0.4.4
    Description: (none)
    RSS Item: http://www.pygame.org/
    Title: Pygame 1.5.3
    Description: (none)
    RSS Item: http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/
    Title: Pyrex 0.4.4.1
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/hb/
    Title: hb 1.88
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/abck/
    Title: abck 2.2
    Description: (none)
    RSS Item: http://www.terra.es/personal7/inigoserna/lfm/
    Title: lfm 0.9
    Description: (none)
    RSS Item: http://www.tundraware.com/Software/waccess/
    Title: waccess 2.0
    Description: (none)
    RSS Item: http://www.krause-software.de/jinsitu/
    Title: JinSitu 0.3
    Description: (none)
    RSS Item: http://www.alobbs.com/pykyra/
    Title: PyKyra 0.1.0
    Description: (none)
    RSS Item: http://www.havenrock.com/developer/treewidgets/index.html
    Title: TreeWidgets 1.0a1
    Description: (none)
    RSS Item: http://civil.sf.net/
    Title: Civil 0.80
    Description: (none)
    RSS Item: http://www.stackless.com/
    Title: Stackless Python Beta
    Description: (none)

    Of course, you would expect somewhat different output because the news items will have changed by the time you try it. The RSS.py channel objects also provide methods for adding and modifying RSS information. You can write the result back to RSS 1.0 format using the output() method. Try this out by writing back out the information parsed in Listing 1. Kick off the script in interactive mode by running: python -i listing1.py . At the resuting Python prompt, run the following example.



    >>> result = tc.output(items)
    >>> print result

    The result is an RSS 1.0 document printed out. You must have RSS.py, version 0.42 or more recent for this to work. There is a bug in the output() method in earlier versions.

    rssparser.py
    Mark Pilgrim offers another module for RSS file parsing. It doesn't provide all the features and options that RSS.py does, but it does offer a very liberal parser, which deals well with all the confusing diversity in the world of RSS. To quote from the rssparser.py page:

    You see, most RSS feeds suck. Invalid characters, unescaped ampersands (Blogger feeds), invalid entities (Radio feeds), unescaped and invalid HTML (The Register's feed most days). Or just a bastardized mix of RSS 0.9x elements with RSS 1.0 elements (Movable Type feeds).
    Then there are feeds, like Aaron's feed, which are too bleeding edge. He puts an excerpt in the description element but puts the full text in the content:encoded element (as CDATA). This is valid RSS 1.0, but nobody actually uses it (except Aaron), few news aggregators support it, and many parsers choke on it. Other parsers are confused by the new elements (guid) in RSS 0.94 (see Dave Winer's feed for an example). And then there's Jon Udell's feed, with the fullitem element that he just sort of made up.

    It's funny to consider this in the light of the fact that XML and Web services are supposed to increase interoperability. Anyway, rssparser.py is designed to deal with all the madness.

    Installing rssparser.py is also very easy. You download the Python file (see Resources), rename it from "rssparser.py.txt" to "rssparser.py", and copy it to your PYTHONPATH. I also suggest getting the optional timeoutsocket module which improves the timeout behavior of socket operations in Python, and thus can help getting RSS feeds less likely to stall the application thread in case of error.

    Listing 3 is a script that is the equivalent of Listing 1, but using rssparser.py, rather than RSS.py.



    import rssparser
    #Parse the data, returns a tuple: (data for channels, data for items)
    channel, items = rssparser.parse("http://www.python.org/channews.rdf")

    for item in items:
    #Each item is a dictionary mapping properties to values
    print "RSS Item:", item.get('link', "(none)")
    print "Title:", item.get('title', "(none)")
    print "Description:", item.get('description', "(none)")



    As you can see, the code is much simpler. The trade-off between RSS.py and rssparser.py is largely that the former has more features, and maintains more syntactic information from the RSS feed. The latter is simpler, and a more forgiving parser (the RSS.py parser only accepts well-formed XML).

    The output should be the same as in Listing 2.

    Conclusion
    There are many Python tools for RSS, and we don't have space to cover them all. Aaron Swartz's page of RSS tools is a good place to start looking if you want to explore other modules out there. RSS is easy to work with in Python, because of all the great modules available for it. The modules hide all the chaos brought about by the history and popularity of RSS. If your XML services needs mostly involve the exchange of descriptive information for Web sites, we highly recommend using the most successful XML service technology in employment.

    Next month, we will explain how to use e-mail packages for Python for writing Web services over SMTP.

    Resources

    About the authors
    Photo of Mike Olson Mike Olson is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML middleware. You can contact Mr. Olson at mike.olson@fourthought.com.


    Photo of Uche Ogbuji Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management applications. Fourthought develops 4Suite, an open source platform for XML middleware. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche.ogbuji@fourthought.com.

    posted on 2005-02-17 02:48 pyguru 閱讀(351) 評(píng)論(0)  編輯  收藏 所屬分類: Build Website
    主站蜘蛛池模板: 7x7x7x免费在线观看| 国产日韩久久免费影院| 6080午夜一级毛片免费看6080夜福利| 国产青草亚洲香蕉精品久久 | 亚洲丝袜美腿视频| 亚洲免费视频播放| 国产卡二卡三卡四卡免费网址 | 亚洲老妈激情一区二区三区| 一级毛片aaaaaa视频免费看| h视频在线观看免费完整版| 亚洲A丁香五香天堂网| 免费看内射乌克兰女| 亚洲精品国产自在久久 | 久久精品国产亚洲av水果派| 中文免费观看视频网站| 亚洲国产精品综合福利专区| 成年人网站在线免费观看| 无码天堂va亚洲va在线va| 中文字幕亚洲一区| 亚洲av成人中文无码专区| 免费在线观看你懂的| 在线观看片免费人成视频无码 | 美女内射无套日韩免费播放| 亚洲Av无码乱码在线观看性色 | 亚洲人成电影网站久久| 国产大片线上免费看| 黄色视频在线免费观看| 亚洲综合一区二区精品久久| WWW国产成人免费观看视频| 亚洲av无码专区国产乱码在线观看| 精品国产成人亚洲午夜福利| 蜜桃视频在线观看免费视频网站WWW| 国产成人亚洲精品影院| 日本亚洲欧洲免费天堂午夜看片女人员| 亚洲 综合 国产 欧洲 丝袜| 亚洲欧美日韩中文二区| 成人免费午夜在线观看| 国产精品极品美女自在线观看免费| 亚洲成AⅤ人影院在线观看| 处破女第一次亚洲18分钟| 国产成人A人亚洲精品无码|