Web Data Mining

One of my articles on Web Data Mining appeared in i.t.magazine. They were kind enough to permit me to make it available from my blog.

Almost all of us need information. A lot of information is freely available on the Web. Learning a few techniques on how to mine information on the Web is a useful skill. Here are some sample usage scenarios:

  • You are an entrepreneur who is planning to start a new software business. You hear that Web 2.0 and social applications are hot. You want to do some research to understand the marketplace, and want to prototype a few product ideas.
  • You are part of the CTO office of a software company, and are interested in short-, medium-, and long-term technology and business trends in your industry. You need this information to build skills in your organization, and to build a few concept prototypes.
  • You are part of the CIO office of an organization. You need to balance early adoption of technologies with providing a stable environment for your business; you don’t want to jump at every new technology. In addition to finding new tools an techniques, you also want to understand the risks and the maturity level of these technologies, which ones are being used for building applications, and you also want to track many non-technical factors.
  • You are an outsourcing company and want to find customers for your business and track trends in outsourcing. Being a jump ahead of your competition and carving a niche are important differentiators.
  • You are part of HR, or a Learning Officer, and need to plan for the skill development of your employees. You want to keep your software team happy and so need to know the latest technologies, tools and resources to plan training and skill development.
  • You are a development lead, and need to provide the team with the latest information on product releases, and access to product/technology knowledge bases. You need to know of any problems, including security issues, in the tools or software that you are currently using for your projects.

Broadly, there are several components to finding, using and sharing information.

  • Identifying and discovering information sources
  • Tracking information from various sources and filtering them for their relevance to your needs
  • Organizing collected information and sharing it with others

Information sources can be many. A few listed below are typical.

Information sources can be categorized as:

  • News sources
  • Company websites
  • Blogs
  • Search engines
  • Wikis
  • Discussion groups
  • Social bookmarking sites
  • Social networks

web-information-sources.jpg

This article ( webdata-mining.pdf) describes these sources and their significance in more detail (the article uses British spelling which is common in India).

Web Information Sources

Here is the mind map of various web information sources. This is not an exhaustive list. I will have a few posts following that describe each one of these in more detail.

web-information-sources.jpg

Look at this entry for some contextual information.

Update Jul 1, 2009

There are a whole host of new sources. So I will add them to comments and try to update this mind map once in a while.

Here are some:

Freebase is a social database of open data
Twine is a smart way to keep track of information and share it with others. It goes beyond simple bookmarking.
data.gov is a fabulous source of  US government information. Will try to find and add other similar resources for other governments.