The SQL Server Team (@SQLServer) announced Apache Hadoop Services for Windows Azure, a.k.a. Apache Hadoop on Windows Azure and Hadooop on Azure, at the Profesional Association for SQL Server (PASS) Summit in October 2011.
Val Fontama’s Availability of Community Technology Preview (CTP) of Hadoop based Service on Windows Azure post of 12/14/2011 described how to obtain an invitation the CTP:
In October at the PASS Summit 2011, Microsoft announced expanded investments in “Big Data”, including a new Apache Hadoop™ based distribution for Windows Server and service for Windows Azure. In doing so, we extended Microsoft’s leadership in BI and Data Warehousing, enabling our customers to glean and manage insights for any data, any size, anywhere. We delivered on our promise this past Monday, when we announced the release of the Community Technology Preview (CTP) of our Hadoop based service for Windows Azure.
Today this preview is available to an initial set of customers. Those interested in joining the preview may request to do so by filling out this survey. Microsoft will issue a code that will be used by the selected customers to access the Hadoop based Service. We look forward to making it available to the general public in early 2012. Customers will gain the following benefits from this preview:
- Breakthrough insights through integration Microsoft Excel and BI tools. This preview ships with a new Hive Add-in for Excel that enables users to interact with data in Hadoop from Excel. With the Hive Add-in customers can issue Hive queries to pull and analyze unstructured data from Hadoop in the familiar Excel. Second, the preview includes a Hive ODBC Driver that integrates Hadoop with Microsoft BI tools. This driver enables customers to integrate and analyze unstructured data from Hadoop using award winning Microsoft BI tools such as PowerPivot and PowerView. As a result customers can gain insight on all their data, including unstructured data stored in Hadoop.
- Elasticity, thanks to Windows Azure. This preview of the Hadoop based service runs on Windows Azure, offering an elastic and scalable platform for distributed storage and compute.
We look forward to your feedback! Learn more at www.microsoft.com/bigdata.
Senior Product Manager
SQL Server Product Management
Following is a step-by-step tutorial for running the first process of the 10GB GraySort sample project:
1. After you receive your invitation code, navigate to https://www.hadooponazure.com/ and log-in with your Windows Live ID and invitation code to open the Account page with the Request a New Cluster content active. Type a globally unique DNS Name for your cluster, hadoop1 for this example, select a Cluster Size (Large for this example), and type a administrative Username, Password and password confirmation:
Note: There is no charge for Windows Azure resources used during the CTP, so you don’t need to provide a credit card to create your cluster.
2. When your cluster is provisioned, the Account page’s content changes to include tiles to create a new job as well as access your cluster by different methods:
Note: You must renew your cluster every three days.
3. Click the Samples tile to open the Account/Samples page, which describes the currently available samples.
4. The GraySort MapReduce sample is a useful starting point because it runs in a reasonably short time (about 4 minutes with a Large cluster), so click the 10GB GraySort tile to open its Account/SampleDetails page, which describes the sample:
5. Click the Deploy to Your Cluster button to automatically populate text boxes with values for the TeraGen program, which generates 10 GB of data:
Note: If you have tried SQL Azure Labs’ Microsoft Codename “Data Numerics” CTP, you’ll notice that the process for creating the Hadoop cluster and executing the first MapReduce job is much more automated that that described in my Deploying “Cloud Numerics” Sample Applications to … post of 1/28/2012 (updated 1/30/2012).
5. Click the Execute Job button to run the TeraGen program, which initially displays this Job Info page:
6. After a few seconds, the program begins adding lines of Debut Output for the 50 maps in increments close to 1 percent:
Note: Hadoop automatically repairs the failures reported above, but it’s surprising that lines for 78 and 79 percent are missing.
7. When processing completes, click the left-arrow button to return to the Account page with a tile for the TeraGen process added:
8. Click the Job History tile to display a summary of the preceding operation, which confirms successful completion with an Exit Code = Ok cell:
9. Click the left-arrow button to return to the main Accounts page and click the Manage Cluster tile to display total storage used (30 GB) and other data source options (Data Market, Windows Azure blob storage, and Amazon S3):
Stay tuned for more details about the second and third MapReduce operations.
Apache Hadoop on Windows Azure Resources
Wesley McSwain posted a Apache Hadoop Based Services for Windows Azure How To Guide to the TechNet wiki on 12/13/2011. The latest update when this post was written was 1/18/2012. Here’s its content:
This content is a work in progress for the benefit of the Hadoop Community. Please feel free to contribute to this wiki page based on your expertise and experience with Hadoop.
If you have any questions, please use the groups DL http://tech.groups.yahoo.com/group/hadooponazurectp/
Table of Contents
- Setup your Hadoop cluster
- Running Sample Jobs
- Writing your own Job and running on Cluster
- Job Administration
- Interactive Console:
- Tasks with Hive on the Interactive Console
- Remote Desktop
- Connecting Windows Azure Blob Storage from Hadoop Cluster
- Open Ports
- Manage Data
- Import Data from Data Market
- Setup ASV – Use your Windows Azure Blob Storage Account
- Setup S3 – Use your Amazon S3 account
- Apache Hadoop on Windows Azure:
Below are some blogs to follow on Hadoop on Azure [links added]
- Alexander Stojanovic (Founder and [General Manager]) of Hadoop on Azure and Windows), @stojanovic, http://conceptualorigami.blogspot.com/
- Dave Vronay, @davevr, http://dvronay.blogspot.com/2011/12/design-of-portal-for-hadooponazurecom.html
- Brad Sarsfield, @bradoop
- Denny Lee, @dennylee, http://dennyglee.com/
- Avkash Chauhan, @avkashchauhan, http://blogs.msdn.com/b/avkashchauhan/