This section will overview about 50 of the most useful libary modules. These modules are proven technology, widely used, heavily tested and constantly improved. The time spent learning these modules will reduce the time it takes you to build an application that does useful work.
We'll dig more deeply into just a few of these modules in subsequent chapters.
As a consultant, we've seen far too many programmers writing modules which overlap these. There are two causes: ignorance and hubris. In this section, we hope to tackle the ignorance cause.
Python includes a large number of pre-built modules. The more you know about these, the less programming you have to do.
Hubris sometimes comes from the feeling that the library module doesn't fit our unique problem well-enough to justify studying the library module. In many cases you can't read the library module to see what it really does. In Python, the documentation is only an introduction; you're encouraged to actually read the library module.
We find that hubris is most closely associated with calendrical calcuations. It isn't clear why programmers invest so much time and effort writing buggy calendrical calculations. Python provides many modules for dealing with times, dates and the calendar.
4. String Services. The String Services modules contains string-related functions or classes. See Chapter 12, Strings for more information on strings.
The re module is the core of text
pattern recognition and processing. A regular
expression is a formula that specifies how to
recognize and parse strings. The re module
is described in detail in Chapter 31, Complex Strings: the re Module.
The avowed purpose of the struct
module is to allow a Python program to access C-language API's; it
packs and unpacks C-language struct object. It turns out that this
module can also help you deal with files in packed binary
formats.
The difflib module contains the
essential algorithms for comparing two sequences, usually
sequences of lines of text. This has algorithms similar to those
used by the Unix diff command (the Window
COMP command).
There are two variations on StringIO
which provide file-like objects that read from or write to a
string buffer. The StringIO module defines
the class StringIO, from which subclasses
can be derived. The cStringIO module
provides a high-speed C-language implementation that can't be
subclassed.
Note that these modules have atypical mixed-case names.
This is a module to format plain text. While the word-wrapping task is sometimes handled by word processors, you may need this in other kinds of programs. Plain text files are still the most portable, standard way to provide a document.
This module has hundreds of text encodings. This includes the vast array of Windows code pages and the Macintosh code pages. The most commonly used are the various Unicode schemes (utf-16 and utf-8). However, there are also a number of codecs for translating between strings of text and arrays of bytes. These schemes include base-64, zip compression, bz2 compression, various quoting rules, and even the simple rot_13 substitution cipher.
5. Data Types. The Data Types modules implement a number of widely-used data structures. These aren't as useful as sequences, dictionaries or strings -- which are built-in to the language. These data types include dates, general collections, arrays, and schedule events. This module includes modules for searching lists, copying structures or producing a nicely formatted output for a complex structure.
The datetime handles details of the
calendar, including dates and times. Additionally, the
time module provides some more basic
functions for time and date processing. We'll cover both modules
in detail in Chapter 32, Dates and Times: the time and
datetime Modules.
These modules mean that you never need to attempt your own calendrical calculations. One of the important lessons learned in the late 90's was that many programmers love to tackle calendrical calculations, but their efforts had to be tested and reworked prior to January 1, 2000, because of innumerable small problems.
This module contains routines for displaying and working with the calendar. This can help you determine the day of the week on which a month starts and ends; it can count leap days in an interval of years, etc.
This package contains two data types, and is likely to grow
with future releases of Python. One tye is the
deque -- a "double-ended queue" -- that can
be used as stack (LIFO) or queue (FIFO). The other class is a
specialized dictionary, defaultdict, which
can return a default value instead of raising an exception for
missing keys.
The bisect module contains the
bisect function to search a sorted list for a
specific value. It also contains the insort
fucntion to insert an item into a list maintaining the sorted
order. This module performs faster than simply appending values to
a list and calling the sort method of a list.
This module's source is instructive as a lesson in well-crafted
algorithms.
The array module gives you a
high-performance, highly compact collection of values. It isn't as
flexible as a list or a tuple, but it is fast and takes up
relatively little memory. This is helpful for processing media
like image or sound files.
The sched module contains the
definition for the scheduler class that
builds a simple task scheduler. When a scheduler is contructed, it
is given two user-supplied functions: one returns the
“time” and the other executes a “delay”
waiting for the time to arrive. For real-time scheduling, the
time module time and
sleep functions can be used. The scheduler
has a main loop that calls the supplied time function and compares
the current time with the time for scheduled tasks; it then calls
the supplied a delay function for the difference in time. It runs
the scheduled task, and calls the delay function with a duration
of zero to release any resources.
Clearly, this simple algorithm is very versatile. By supplying custom time functions that work in minutes instead of seconds, and a delay function that does additional background processing while waiting for the scheduled time, a flexible task manager can be constructed.
The copy module contains functions
for making copies of complex objects. This module contains a
function to make a shallow copy of an
object, where any objects contained within the parent are not
copied, but references are inserted in the parent. It also
contains a function to make a deep copy of
an object, where all objects contained within the parent object
are duplicated.
Note that Python's simple assignment only creates a variable which is a label (or reference) to an object, not a duplicate copy. This module is the easiest way to create an independent copy.
The pprint module contains some
useful functions like pprint.pprint for
printing easy-to-read representations of nested lists and
dictionaries. It also has a PrettyPrinter
class from which you can make subclasses to customize the way in
which lists or dictionaries or other objects are printed.
6. Numeric and Mathematical Modules. These modules include more specialized mathemathical functions and some additional numeric data types.
The decimal module provides decimal-based arithmetic which correctly handles significant digits, rounding and other features common to currency amounts.
The math module was covered in the section called “The math Module”. It contains the math functions like
sine, cosine and square root.
The random module was covered in
the section called “The math Module”.
7. Internet Data Handling. The Internet Data Handling modules contain a number of handy algorithms. A great deal of data is defined by the Internet Request for Comments (RFCs). Since these effectively standardize data on the Internet, it helps to have modules already in place to process this standardized data. Most of these modules are specialized, but a few have much wider application.
These modules all provide various kinds of conversions, ecapes or quoting so that binary data can be manipulated as safe, universal ASCII text. The number of these modules reflects the number of different clever solutions to the problem of packing binary data into ordinary email messages.
8. Structured Markup Processing Tools. The following modules contain algorithms for working with structured markup: Standard General Markup Lanaguage (SGML), Hypertext Markup Language (HTML) and Extensible Markup Language (XML). These modules simplify the parsing and analysis of complex documents. In addition to these modules, you may also need to use the CSV module for processing files; that's in chapter 9, File Formats.
Ordinary HTML documents can be examined with the
htmllib module. This module based on the
sgmllib module. The basic
HTMLParser class definition is a
superclass; you will typically override the various functions to
do the appropriate processing for your application.
One problem with parsing HTML is that browsers — in order to
conform with the applicable standards — must accept incorrect
HTML. This means that many web sites publish HTML which is
tolerated by browsers, but can't easily be parsed by
htmllib. When confronted with serious
horrows, consider downloading the Beautiful Soup module. This
handles erroneous HTML more gracefully than
htmllib.
The xml.sax and
xml.dom modules provide the classes
necessary to conveniently read and process XML documents. A SAX
parser separates the various types of content and passes a series
of events the handler objects attached to the parser. A DOM parser
decomposes the document into the Document Object Model
(DOM).
The xml.dom module contains the
classes which define an XML document's structure. The
xml.dom.minidom module contains a parser
which creates a DOM object.
Additionally, there is a Miscellaneous Module (in chapter 33) that goes along with these.
The formatter module can be used in
conjunction with the HTML and XML parsers. A formatter instance
depends on a writer instance that produces the final (formatted)
output. It can also be used on its own to format text in different
ways.
9. File Formats. These are modules for reading and writing files in a few of the amazing variety of file formats that are in common use. In addition to these common formats, modules in chapter 8, Structured Markup Processig Tools are also important.
The csv module helps you parse and
create Comma-Separated Value (CSV) data files.
This helps you exchange data with many desktop tools that produce
or consume CSV files. We'll look at this in the section called “Comma-Separated Values: The csv
Module”.
Configuration files can take a number of forms. The simplest
approach is to use a Python module as the configuration for a
large, complex program. Sometimes configurations are encoded in
XML. Many Windows legacy programs use .INI
files. The ConfigParser can gracefully parse these files. We'll
look at this in the section called “Property Files and Configuration (or.INI)
Files: The ConfigParser Module”.
10. Cryptographic Services. These modules aren't specifically encryption modules. Many popular encryption algorithms are protected by patents. Often, encryption requires compiled modules for performance reasons. These modules compute secure digests of messages using a variety of algorithms.
Compute a secure hash or digest of a message to ensure that it was not tampered with. MD5, for example, is often used for validating that a downloaded file was recieved correctly and completely.
11. File and Directory Access. We'll look at many of these modules in Chapter 33, File Handling Modules. These are the modules which are essential for handling data files.
The os and
os.path modules are critical for creating
portable Python programs. The popular operating systems (Linux,
Windows and MacOS) each have different approaches to the common
services provided by an operating system. A Python program can
depend on os and
os.path modules behaving consistently in
all environments.
One of the most obvious differences among operating systems
is the way that files are named. In particular, the
path separator can be either the POSIX
standard /, or the windows \.
Additionally, the Mac OS Classic mode can also use :.
Rather than make each program aware of the operating system rules
for path construction, Python provides the
os.path module to make all of the common
filename manipulations completely consistent.
The fileinput module helps your
progam process a large number of files smoothly and simply.
The glob and
fnmatch modules help a Windows program
handle wild-card file names in a manner consistent with other
operating systems.
The shutil module provides shell-like
utilities for file copy, file rename, directory moves, etc. This
module lets you write short, effective Pytthon programs that do
things that are typically done by shell scripts.
Why use Python instead of the shell? Python is far easier to read, far more efficient, and far more capable of writing moderately sophisticated programs. Using Python saves you from having to write long, painful shell scripts.
12. Data Compression and Archiving. These modules handle the various file compression algorithms that are available. We'll look at these modules in Chapter 33, File Handling Modules.
These two modules create archive files, which contain a number of files that are bound together. The TAR format is not compressed, where the ZIP format is compressed. Often a TAR archive is compressed using GZIP to create a .tar.gz archive.
These modules are different compression algorithms. They all have similar features to compress or uncompress files.
13. Data Persistence. There are several issues related to making objects persistent. In Chapter 9 of the Python Reference, there are several modules that help deal with files in various kinds of formats. We'll talk about these modules in detail in Chapter 34, File Formats: CSV, Tab, XML, Logs and Others.
There are several additional techniques for managing persistence. We can "pickle" or "shelve" an object. In this case, we don't define our file format in detail, instead we leave it to Python to persist our objects.
We can map our objects to a relational database. In this case, we'll use the SQL language to define our storage, create and retrieve our objects.
The pickle and
shelve modules are used to create
persistent objects; objects that persist beyond the one-time
execution of a Python program. The pickle
module produces a serial text representation of any object,
however complex; this can reconstitute an object from its text
representation. The shelve module uses a
dbm database to store and retrieve objects.
The shelve module is not a complete
object-oriented database, as it lacks any transaction management
capabilities.
This module provides access to the SQLite relational database. This database provides a significant subset of SQL language features, allowing us to build a relational database that's compatible with products like MySQL or Postgres.
14. Generic Operating System Services. The following modules contain basic features that are common to all operating systems. Most of this commonality is acheived by using the C standard libraries. By using this module, you can be assured that your Python application will be portable to almost any operating system.
These modules provide access to a number of operating system
features. The os module provides control
over Processes, Files and Directories. We'll look at
os and os.path in
the section called “The os Module” and the section called “The os.path Module”.
The time module provides basic
functions for time and date processing. Additionally
datetime handles details of the calendar
more gracefully than time does. We'll cover
both modules in detail in Chapter 32, Dates and Times: the time and
datetime Modules.
Having modules like datetime and
time mean that you never need to attempt
your own calendrical calculations. One of the important lessons
learned in the late 90's was that many programmers love to tackle
calendrical calculations, but their efforts had to be tested and
reworked because of innumerable small problems.
A well-written program makes use of the command-line
interface. It is configured through options and arguments, as well
as properties files. We'll cover the
getopt, optparse and
glob modules in Chapter 35, Programs: Standing Alone.
Often, you want a simple, standardized log for errors as
well as debugging information. We'll look at logging in detail in
the section called “Log Files: The logging Module”.
18. Internet Protocols and Support. The following modules contain algorithms for responding the several of the most common Internet protocols. These modules greatly simplify developing applications based on these protocols.
The cgi module is used for web server
applications invoked as CGI scripts. This allows you to put Python
programming in the cgi-bin
directory. When the web server invokes the CGI script, the Python
interpreter is started and the Python script is executed.
These modules allow you to write relatively simple
application programs which open a URL as if it were a standard
Python file. The content can be read and perhaps parsed with the
HTML or XML parser modules, described below. The
urllib module depends on the
httplib, ftplib and
gopherlib modules. It will also open local
files when the scheme of the URL is file:. The
urlparse module includes the functions
necessary to parse or assemble URL's. The
urllib2 module handles more complex
situations where there is authentication or cookies
involved.
The httplib,
ftplib and gopherlib
modules include relatively complete support for building client
applications that use these protocols. Between the
html module and
httplib module, a simple character-oriented
web browser or web content crawler can be built.
The poplib and
imaplib modules allow you to build mail
reader client applications. The poplib
module is for mail clients using the Post-Office Protocol, POP3
(RFC 1725), to extract mail from a mail server. The
imaplib module is for mail servers using
the Internet Message Access Protocol, IMAP4 (RFC 2060) to manage
mail on an IMAP server.
The nntplib module allows you to
build a network news reader. The newsgroups, like
comp.lang.python, are processed by NNTP
servers. You can build special-purpose news readers with this
module.
The SocketServer module provides the
relatively advanced programming required to create TCP/IP or
UDP/IP server applications. This is typically the core of a
stand-alone application server.
The SimpleHTTPServer and
CGIHTTPServer modules rely on the basic
BaseHTTPServer and
SocketServer modules to create a web
server. The SimpleHTTPServer module
provides the programming to handle basic URL requests. The
CGIHTTPServer module adds the capability
for running CGI scripts; it does this with the
fork and exec functions
of the os module, which are not necessarily
supported on all platforms.
The asyncore (and
asynchat) modules help to build a
time-sharing application server. When client requests can be
handled quickly by the server, complex multi-threading and
multi-processing aren't really necessary. Instead, this module
simply dispatches each client communication to an appropriate
handler function.
22. Program Frameworks. We'll talk about a number of program-related issues in Chapter 35, Programs: Standing Alone and Chapter 36, Programs: Clients, Servers, the Internet and the World Wide Web. Much of this goes beyond the standard Python library. Within the library are two modules that can help you create large, sophisticated command-line application programs.
The cmd module contains a superclass
useful for building the main command-reading loop of an
interactive program. The standard features include printing a
prompt, reading commands, providing help and providing a command
history buffer. A subclass is expected to provide functions with
names of the form do_command. When the user
enters a line beginning with command, the
appropriate do_command function is
called.
The shlex module can be used to
tokenize input in a simple language similar to the Linux shell
languages. This module defines a basic
shlex class with parsing methods that can
separate words, quotes strings and comments, and return them to
the requesting program.
26. Python Runtime Services. The Python Runtime Services modules are considered to support
the Python runtime environment. These can be divided into two groups:
those that are an interface into the Python interpreter, and those
that are generally useful for programming. The interpreter interface
allows us to peer under the hood at how Python works internally. The
programming category is more generally useful, and includes
sys, pickle, and
shelve.
The sys module contains execution
context information. It has the command-line arguments (in
sys.argv) used to start the Python interpreter.
It has the standard input, output and error file definitions. It
has functions for retrieving exception information. It defines the
platform, byte order, module search path and other basic facts.
This is typically used by a main program to get run-time
environment information.