SQLServerBrains

Friday 24 August 2012

Business Intelligence : New capabilities delivered for end users in Excel 2013

Ability to analyze data ranging from a few rows to hundreds of millions of rows with extreme analytical performance on your desktop using Excel data models powered by xVelocity in-memory analytics engine
Opportunity to speed up analysis in Excel by easily cleaning up and shaping your data with Flash Fill; by previewing, applying conditional formatting, creating tables, charts and PivotTables using Quick Analysis; by easily navigating multidimensional and tabular data models and creating trend charts to analyze information over time using Quick Explore.

Quick Explore enables a user to look at the product categories summarized by Sales Quantity with 1 click:

Quick Analysis suggests a Pivot Table type with a preview of actual data:

Empower users of all levels to access, mash-up and analyze data from virtually any source and rapidly create compelling analytical applications with PowerPivot.

PowerPivot Diagram View shows the data relationship:

Provide stunning data visualization to discover new insights at the speed of thought with a highly interactive and familiar data exploration, visualization, and presentation experience for users of all levels with Power View.

Interactive visualization with Power View in Excel 2013:

New capabilities delivered for IT:

At the same time, we are strengthening our IT management and oversight capability by

Enabling discovery and assessment of user-created spreadsheets with SharePoint 2013
Enabling comparing spreadsheets, tracking lineage, conducting interactive diagnostics and creating spreadsheet analysis reports with Inquire in Excel

SharePoint 2013 discovery and assessment criteria setup:

Inquire in Excel showing spreadsheet lineage and workbook analysis:

Wednesday 15 August 2012

Power Pivot with Excel 2010 - Features and Getting Started

PowerPivot for Excel is an add-on for Excel 2010. PowerPivot allows users to conduct powerful business intellegence (BI) in an environment that is familier. It's a free download from Microsoft and allows users to work with extremely large data sets. Before PowerPivot, this kind of analysis was limited to enterprise BI tools such as SAS and Business Objects.

PowerPivot uses an in-memory engine called VertiPaq. This new SSAS engine takes advantage of the increased RAM available in most personal computers today. Most IT shops are challenged with the resources needed to build out an enterprise BI environment. PowerPivot moves some of this work closer to the business user. While there are many features in PowerPivot for Excel.

Work With Very Large Data Sets

If you open Microsoft Excel and move to the very bottom of a worksheet, you will see that the maximum number of rows is 1,048,576. This represents about a million rows of data. With PowerPivot for Excel, there is no limit on the number of rows of data. While this is a true statement, the actual limitation is based on the version of Microsoft Excel you are running and whether you are going to publish your spreadsheet to SharePoint 2010.

If you are running the 64-bit version of Excel, PowerPivot can reportedly handle about 2 GB of data. You must have enough RAM to make this work. If you plan to publish your PowerPivot based Excel spreadsheet to SharePoint 2010, the maximize file size is also 2 GB. The bottom line is PowerPivot for Excel can handle millions of records. If you hit the maximum, you will receive a memory error.

Combine Data From Different Sources

This has to be one of the most important features in PowerPivot for Excel. Excel has always been able to handle different data sources such as SQL Server, XML, Microsoft Access and even web based data. The problem comes when you need to created relationships between different data sources. There are 3rd party products available to help with this, and you can use Excel functions like VLOOKUP to "join" data. But these methods are impractical for large data sets. PowerPivot for Excel is built to accomplish this task.

Within PowerPivot, you can import data from virtually any data source. I have found that one of the most useful data sources is a SharePoint List. I have used PowerPivot for Excel to combine data from SQL Server and a list from SharePoint. You will need SharePoint 2010 to make this work along with the ADO.Net runtime installed on the SharePoint environment.

When you connect PowerPivot to a SharePoint list, you are actually connecting to a Data Feed. To create a Data Feed from a SharePoint list, open the list and click on the List ribbon. Then click on Export as Data Feed and save it. The feed is available as a URL in PowerPivot for Excel.

There is one challenge that will be resolved by Microsoft soon. PowerPivot for Excel only supports inner joins. If you need to do an outer join, you will need to wait until an update is released from Microsoft or tweak your data so that an inner join can be used.

Create Visually Apealing Analytical Models

PowerPivot for Excel allows you to output a variety of visual data to your Excel worksheet. You can return data in a PivotTable, PivotChart, Chart and Table (horizontal and vertical), Two Charts (horizontal and vertical), Four Charts and a Flattened PivotTable. The power comes when you create a worksheet that includes multiple outputs. This provides a dashboard view of the data that makes analysis really easy. Even your executives should be able to interact with your worksheet if you build it correctly. Slicers, which shipped with Excel 2010, makes it simple to visually filter data.

Use DAX to Create Calculated Fields for Slicing and Dicing Data

DAX stands for Data Analysis Expressions and is the formula language used in PowerPivot tables. There are similarities between Excel formulas and DAX. The DAX formula language is used primarily in creating calculated columns. Check out the DAX Reference in TechNet for a complete reference.

In a regular Pivot Table in Excel that included a properly formated date field, you can use grouping to include the ability to filter or group by year, quarter, month and day. In PowerPivot, you need to create these as calculated columns to accomplish the same thing. Add a column for each way you need to filter or group data in your Pivot Table. Many of the date functions in DAX are the same as Excel formulas which makes this a snap.

For example, use =YEAR([date column]) in a new calculated column to add the year to your data set in PowerPivot. You can then use this new YEAR field as a slicer or group by in your Pivot Table.

Publish Dashboards to SharePoint 2010

If your company is like mine, dashboard are still the work of your IT team. PowerPivot when combined with SharePoint 2010 puts the power of dashboards into the hands of your users. One of the prerequisites of publishing PowerPivot driven charts and tables to SharePoint 2010 is the implementation of PowerPivot for SharePoint on your SharePoint 2010 farm

Installing PowerPivot

The PowerPivot plugin for Excel can be downloaded from the official PowerPivot site: www.powerpivot.com.

PowerPivot can be installed on both x86 and x64 systems including Windows XP sp3, Windows Vista sp1, and Windows 7. If you aren’t installing it on Windows 7, you will need to install .Net Framework 3.5 SP1.

A minimum of 1G of memory is required to run PowerPivot, but 2G is recommended. Depending on the solution, more memory for PowerPivot can increase the performance of the transformations due to its in-memory processing capabilities.

Also on the PowerPivot site is a link to a trial for Microsoft Office 2010 Professional Plus if you don’t already have it installed. If you customize the installation of the Office Suite, you need to choose at least Excel and the Office Shared Tools.

Once Microsoft Office is installed, you can download the PowerPivot install from the download page on the PowerPivot site. Locate and click on the PowerPivot install file. The add-in will actually install the next time you run Excel, at which time you will be asked to approve the installation of the PowerPivot add-in.

Creating Your First PowerPivot

Now that the PowerPivot add-in is installed, you can see that you have a new PowerPivot tab across the top of Excel. Clicking on that tab displays the PowerPivot ribbon. To select the data to use for your PowerPivot, you will need to click on the PowerPivot Window button (highlighted on the left below).

On the PowerPivot window, select Get External Data from Database -> From SQL Server. Notice that you can also get data from many other sources including data feeds, text files, and the web. In this case, however, we are going to use the AdventureWorks2008DW sample database.

In the Table Import Wizard, specify localhost for the server and AdventureWorks2008DW as the Database name. If you need to install AdventureWorks2008DW, you can find it here.

Hit Next and make sure to select “Select from a list of tables and views to choose the data to import” from the next screen.

From the table list, select the FactResellerSales table and rename it to ResellerSales in the Friendly Name column.

While you are there, click the Select Related Tables button. This selects the 7 tables in the database that have a relationship with the FactResellerSales table. Click Preview & Filter just to have a look; we won’t change anything here, though you could use it to filter out rows and columns from the model.

Click Cancel to get out of Preview & Filter and then hit Finish on the Table Import Wizard screen. This brings up the following window which displays the progress on each table import.

A Details link displays in the message column on each import line if there is information or errors regarding the import. Notice there is a detail link on final step, data preparation. Clicking that displays a popup window with information about the data preparation. Most of it is successful, but it does inform you of a few errors. For instance, self-joins are not supported in the case of employee. Fortunately, none of these affect the pivot we are creating.

Closing the Table Import Wizard reveals the model from which we will work. First, let’s tweak a few things. You see each imported table is represented on its own tab in the model. Let’s select the ResellerSales tab.

Scroll all the way to the right and click on “Add Column” in the header. For the expression, subtract TotalProductCost from SalesAmount. You can either type in the expression as it appears below, or build it as you click the columns.

Now we need to give our new computed column a better name than ComputedColumn1. Right-click on the column header and select “Rename Column” from the drop down menu. Change the name to “Profit”.

To get started on the Pivot Table, we need to switch back to the workbook itself. You can either just click on the spreadsheet in the background or you can use the toolbar button that looks like the Excel icon (next to Formatting) to switch back.

Once back in the workbook, select the PivotTable button and choose Chart and Table (Horizontal) from the menu. This allows you to create both a chart and table for your data contained on one table. The chart will be on the left and data will be on the right.

When asked for a decision on placing it in a new or existing workbook, choose the existing workbook.

The Field List panel on your right controls the contents of the pivot. Let’s begin by dragging the fields we are interested in from the list into their respective areas. First drag the amount fields for TotalProcuctCost, SalesAmount, and Profit from the ResellerSales table down to the Values list. Drag the SalesTerritoryRegion from DimSalesTerritory to the Row Labels list. Similarly, drag the ResellerName column from the DimReseller table down to the Row Labels list and drop it in below the SalesTerritoryRegion column.

Now that the report is filled in, we’ll clean it up a little. Change the header “Row Labels” to “Region\Reseller”. Change the headers over the three amount fields to remove “Sum of” from the name. Your pivot report and Field List panel should resemble the following screen shot.

It’s starting to look good, but the chart is way too busy. Maybe we should just look at one region at a time. So let’s add a horizontal slicer. From the Field List, choose SalesTerritoryRegion again and drag it to the Slicers Horizontal list. Now you see a button for each region displayed at the top of the report. Click on France. You’ll see the report and chart filter down to display just the region of France and its resellers.

Let’s further narrow down the chart by doing a manual filter using the filter button on the chart for ResellerName. From the list, first click Select All to unselect all the resellers, then pick just Accessories Network, Ace Bicycle Supply, and Atypical Bike Company. This reduces the Resellers displayed on the chart to just these three. Note that if we would have selected resellers not from France (our existing Region filter), the chart would be empty.

The resulting chart is much easier to read having just the three resellers of interest graphed.

We are not quite done. It’s bothering me that our amounts have more than 2 decimal places, so let’s take care of that real quick. Select all the data for the three amount columns and when the formatting menu appears, select the $ icon as shown below.

And now, the final product. Congratulations on completing your first PowerPivot report.

Friday 10 August 2012

Views and its types in SQL Server 2008

A view contains rows and columns, just like a real table. The fields in a view are fields from one or more real tables in the database.
You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data were coming from one single table.

SQL Server 2008 organizes user-defined views into three major categories:

Standard Views: Most of the time, we create this kind of view where we use single or more that one tables, use JOINs and other clauses and pull the data out. The tables are saved on a single physical file, most of the time on primary data file – that is .mdf file. All tables don’t have any kind of partitions – I’ll discuss partitions in Partitions Views section below. So, most of views that we write come under this category.
Indexed Views: Usually views pull the data out from underlying tables and utilize indexes created on table columns. However, an indexed view allows us to create a unique clustered index on it – allowing us to physically keep a separate copy of sorted data based on which column we created a unique clusetered index. This dramatically improves the view’s performance for some types of queries. An indexed view works best for queries that are giving summary data. However, this type of view is not recommended in case there is high frequencyof data insert, update or delete on underlying table(s).
Partitioned Views:This type of view joins horizontally partitioned data from a set of members table across one or more servers. We partition a table horizontally and keep data partitions either on one server or on multiple servers (federated servers) in order to gain performance. We may sub-categorize this category into two:

Local Partitioned Views: This joins required tables on the same instance of SQL Server ie. same SQL Server engine.
Distributed Partitioned Views: This joins required tables across servers i.e. multiple SQL Server engines.

Standard Views:

In SQL Server a view represents a virtual table. You can say it is a subset of a table. Just like a real table, a view consists of rows with columns, and we can retrieve data from a view (sometimes even update data in a view). The fields in the view's virtual table are the fields of one or more real tables in the database.

Creating a view, by example

The following is an example of a simple SQL Server view using the Student table. Note: You may have to update some of the rows to add data to see the effects of this view.

IF ( OBJECT_ID('dbo.vw_students1') IS NOT NULL ) 
   DROP VIEW dbo.vw_students1 
GO

CREATE VIEW dbo.vw_students1 
AS
  SELECT 
    lastname    , 
    firstname   , 
    (firstname + ' ' + lastname) as "Fullname_fl",
    (lastname + ', ' + firstname) as "Fullname_lf",
    birth_dttm,  
    DATEDIFF(yy, birth_dttm, GETDATE()) - 
    CASE 
       WHEN MONTH(birth_dttm) > MONTH(GETDATE()) 
            OR 
            (MONTH(birth_dttm) = MONTH(GETDATE()) 
             AND DAY(birth_dttm) > DAY(GETDATE())
            ) 
    THEN 1 ELSE 0 
    END as "Age"  
    FROM   Students        
GO

In general, you should adopt some naming standard for your views. This standard is vw_<name of view>. The name should be somewhat reflective of the purpose of the view. You can clearly see that I did not do such a good job with this. Sometimes coming up with a short descriptive name for your view is easier said than done.

The syntax for creating a view is...

CREATE OR REPLACE VIEW `<your_view_name>`

...followed by a normal SQL SELECT. This SELECT can include a WHERE clause or anything else for that matter that can be put into a SELECT statement. The scenarios are endless. It really depends on the purpose of the view.

As you can see in our view we are formatting the first and last name. This is a pretty common thing to do, By having a view that already does this we save having to write that function in every query where this is a requirement. You can also see that we have take the birth date column and calculated age.

Executing a View

Execute an SQL View

The example below shows all of the code from the view. You could also do a SELECT *, or further restrict the columns you want to see. You can also add additional row restriction to the view as we have done.

SELECT  TOP 100
        lastname     , 
        firstname    , 
        Fullname_fl  , 
        Fullname_lf  , 
        birth_dttm   , 
        Age         
FROM    dbo.vw_students1 (NOLOCK) 
WHERE Age is not null
GO

Indexed View


On the surface, an indexed view is the same thing as a normal view, however the critical difference is that an indexed view actually allows you to create a clustered index on it, effectively working around the "one clustered index per table" limitation. The downside to this is that the indexed view is a full copy of the data from the underlying tables, so for space considerations you will need to take this into account when using them.

Let's look first at how to create a normal view, so that we have something to compare to as we create an indexed view

create view ContactsView as 
( 
      select [Users].[FirstName], [Users].[LastName], [Contacts].* 
        from [Users] 
       inner join [Contacts] on [Contacts].[UserID] = [Users].[UserId] 
);

What this will do is create a pretty flexible view that will only pull the first and last name of the user from the users table, and all of the columns from the Contacts table and automatically pick up any new columns that are added to the Contacts table, with no further modifications.  Indexed views require a more rigid definition of what they will return.  For this reason, we cannot use a wildcard (*) in the underlying query.  Each column must be individually called out; also, each table named in the query must be in 2 part dot notation, referencing both the schema of the table and the table name itself, and the view must be declared with schemabinding enabled for it.





 Below is the modified query to create a view capable of being an indexed view:


create view [dbo].[OrganizationContactsView] with schemabinding as 
( 
      select [FirstName], [LastName], [Contacts].[UserID],[Contacts].[ContactDescription], [Contacts].[IsPublic] 
        from [dbo].[Contacts] 
       inner join [dbo].[Users] on [Contacts].[UserID] = [Users].[UserId] 
 );

 Now, we need to create the clustered index on it.  Clustered indexes on a view must be unique, otherwise you will not be able to create one; nonclustered indexes will work fine.  For the present example, a UserId will only be in the contacts table once per Organization, so we can make a unique clustered index spanning the OrgId and UserId columns, like this:




create unique clustered index IX_OrganizationContactsView_OrgId on[OrganizationContactsView](OrgId,UserId)




Partitioned views




Partitioned views are used to access data that has been horizontally split, or partitioned, across multiple tables. These tables can be in the same or different databases—or even spread across multiple servers. Partitioning of tables is done to spread the I/O and processing load of large tables across multiple disks or servers.




You combine the tables in a partitioned view by using a UNION ALL statement that causes the data from the separate tables to appear as if they were one table. These separate tables are referred to as member tables or base tables. The member tables in a SELECT statement of the view must all be structured in the same way, and the view must adhere to the following restrictions:


All the columns from the member tables should be included in the view definition.


Columns with the same ordinal position in the SELECT list should have the same data type.


The same column cannot be used multiple times in the SELECT list.


A partitioning column that segments the data must be identified and needs to have the same ordinal position across all the member table SELECT statements.


The partitioning column cannot be a computed column, an identity, a default, or a time stamp.


The data values in the partitioning column cannot overlap in the underlying tables.


The partitioning column must be part of the primary key of the member table.


The member tables in the partitioned view need a CHECK constraint on the partitioning column.


A table can appear only once as part of the UNION ALL statement.


The member tables cannot have indexes created on computed columns in the table.


The number of columns in the member table primary key constraints should be the same.


All member tables should have the same ANSI PADDING setting when created.



The list of restrictions for creating partitioned views is extensive, but the creation of a partitioned view is relatively straightforward and intuitive. Consider, for example, the Sales.SalesOrderHeader table in the Adventureworks2008database. This table is relatively small, but it is the type of table that could have a large number of rows and experience heavy utilization. To balance the workload against this table, you could use a partitioned view that utilizes base tables that each contain a separate year’s data. Listing 1 shows the CREATE TABLE statements to create the base tables for each year. The yearly tables are intended to hold summarized daily numbers, and each contains only a subset of the columns in the Sales.SalesOrderHeader table.


Listing 1. Creating the Base Tables for a Partitioned View

CREATE TABLE Sales.Sales_2001
(
    OrderDay datetime NOT NULL
        CHECK (OrderDay BETWEEN '20010101' AND '20011231'),
    SubTotal money NOT NULL ,
    TaxAmt money not null,
    Freight money not null,
 CONSTRAINT PK_Sales_2001_OrderDay PRIMARY KEY CLUSTERED (OrderDay ASC)
)

CREATE TABLE Sales.Sales_2002
(
    OrderDay datetime NOT NULL,
        CHECK (OrderDay BETWEEN '20020101' AND '20021231'),
    SubTotal money NOT NULL ,
    TaxAmt money not null,
    Freight money not null,
 CONSTRAINT PK_Sales_2002_OrderDay PRIMARY KEY CLUSTERED (OrderDay ASC)
)

CREATE TABLE Sales.Sales_2003
(
    OrderDay datetime NOT NULL
        CHECK (OrderDay BETWEEN '20030101' AND '20031231'),
    SubTotal money NOT NULL ,
    TaxAmt money not null,
    Freight money not null,
 CONSTRAINT PK_Sales_2003_OrderDay PRIMARY KEY CLUSTERED (OrderDay ASC)
)

CREATE TABLE Sales.Sales_2004
(
    OrderDay datetime NOT NULL
        CHECK (OrderDay BETWEEN '20040101' AND '20041231'),
    SubTotal money NOT NULL ,
    TaxAmt money not null,
    Freight money not null,
 CONSTRAINT PK_Sales_2004_OrderDay PRIMARY KEY CLUSTERED (OrderDay ASC)
)

Notice that each table has a primary key on OrderDay, the partitioning column. Also notice that a CHECK constraint is defined for each table; it ensures that only orders for the given year can be stored in the table.



To demonstrate the power of a partitioned view, it is best to populate the base tables that will be used by the view.Listing 2 contains a series of INSERT statements that select from the Sales.SalesOrderHeader table and populate the base tables. The SELECT statements summarize several key columns by day and contain a WHERE clause that limits the result to orders for the respective years.


Listing 2. Populating the Base Tables for a Partitioned View
INSERT Sales.Sales_2001
   SELECT CONVERT(VARCHAR(8),OrderDate,112),
      SUM(SubTotal), SUM(TaxAmt), SUM(Freight)
    FROM Sales.SalesOrderHeader
    WHERE OrderDate between '20010101' AND '20011231'
    GROUP BY CONVERT(VARCHAR(8),OrderDate,112)
INSERT Sales.Sales_2002
   SELECT CONVERT(VARCHAR(8),OrderDate,112),
      SUM(SubTotal), SUM(TaxAmt), SUM(Freight)
    FROM Sales.SalesOrderHeader
    WHERE OrderDate between '20020102' AND '20021231'
    GROUP BY CONVERT(VARCHAR(8),OrderDate,112)

INSERT Sales.Sales_2003
   SELECT CONVERT(VARCHAR(8),OrderDate,112),
      SUM(SubTotal), SUM(TaxAmt), SUM(Freight)
    FROM Sales.SalesOrderHeader
    WHERE OrderDate between '20030101' AND '20031231'
    GROUP BY CONVERT(VARCHAR(8),OrderDate,112)

INSERT Sales.Sales_2004
   SELECT CONVERT(VARCHAR(8),OrderDate,112),
      SUM(SubTotal), SUM(TaxAmt), SUM(Freight)
    FROM Sales.SalesOrderHeader
    WHERE OrderDate between '20040102' AND '20041231'
    GROUP BY CONVERT(VARCHAR(8),OrderDate,112)



Now that you have the populated base table, you can create a partitioned view and ensure that the view is selecting only from the base tables that it needs.

Two types of partitioned views are discussed in this article: local and distributed. A local partitioned view utilizes base tables found on the same server. A distributed partitioned view contains at least one base table that resides on a different (remote) server. The focus in the section is on local partitioned views. The T-SQL for creating a local partitioned view named Sales.vw_Sales_Daily is shown in Listing 3.


Listing 3. Creating a Local Partitioned View
Create View Sales.vw_Sales_Daily
 as
      SELECT * FROM Sales.Sales_2001
       UNION ALL
      SELECT * FROM Sales.Sales_2002
       UNION ALL
      SELECT * FROM Sales.Sales_2003
       UNION ALL
      SELECT * FROM Sales.Sales_2004


The best way to validate that a partitioned view is working properly is to run a conditional SELECT against the view and display the execution plan. If the partitioned view is functioning properly, it should be accessing only the base tables it needs to satisfy the SELECT and should not access all the tables in the view unless it needs to. The following example shows a sample SELECT against the new partitioned view:




SELECT * FROM Sales.vw_Sales_Daily
 WHERE OrderDay > '20040701'
   and SubTotal > 2000




If you execute this statement and review the actual execution plan, you see that an index seek is performed against the Sales.Sales_2004 table. This is the correct result, given that the SELECT statement is targeting order data from 2004.


Modifying Data Through a Partitioned View


You can modify data via a partitioned view if the SQL statement performing the modification meets certain conditions, as described here:


All columns in the partitioned view must be specified in the INSERT statement. Columns that include a DEFAULT constraint or allow nulls are also subject to this requirement.


The DEFAULT keyword cannot be used on inserts to partitioned views or on updates to partitioned views.


UPDATE statements cannot modify PRIMARY KEY columns if the member tables have text, ntext, or image columns.


Inserts and updates to a partitioned view are not allowed if the view contains a time stamp.


Identity columns in a partitioned view cannot be modified by an INSERT or UPDATE statement.


INSERT, UPDATE, and DELETE statements are not allowed against a partitioned view if there is a self-join with the same view or with any of the member tables in the statement.




Note

Data can be modified through partitioned views only in the Enterprise and Developer Editions of SQL Server 2008.


In addition to the conditions shown in this list, you must also satisfy any restrictions that apply to the member tables. Check constraints, foreign key constraints, and any other table-level restrictions must be accounted for in the modification statement. The user executing the modification against the partitioned view must have the appropriate INSERT, UPDATE, or DELETE permissions on the member tables for the update to succeed.


Distributed Partitioned Views


Microsoft provides distributed partitioned views (DPVs) as a primary means to scale out a database server. Scalability allows an application or a database to utilize additional resources, which allows it to perform more work. There are two kinds of scalability: scaleup and scaleout. A scaleup solution focuses on a single server scaled to provide more processing power than its predecessor. An example of scaleup would be migrating from a server with a single dual-core processor to a machine with 4-quad-core processor. Scaleout solutions include the addition of servers to augment the overall processing power.

DPVs are similar to local partitioned views, but they utilize one or more tables located on a remote server. The placement of partitioned data on remote servers allows the processing power of more than one server to be utilized. The partitioning is intended to be transparent to the application and allow for additional partitions and servers as the application’s needs scale.

The following list outlines the basic requirements for creating a DPV:


A linked server definition is added to each member server that will contain the partitioned data. The linked server contains the connection information required to run distributed queries on another member server.


The lazy schema validation option is set to true on each of the member servers, using sp_serveroption. This option is set for performance reasons and allows the query processor to skip schema checking of remote tables if the query can be satisfied on a single member server.


A DPV is created on each member server. This DPV references the local tables in addition to the tables found on the other member servers.



Listing 3 shows SQL commands that can be used to satisfy the requirements in the preceding list. The DPV created in the last portion of the script is similar to the local partitioned view created in the previous section. The key difference in this DPV example is the inclusion of a distributed query that retrieves records for Sales.Sales_2002from a remote server. The remote server in this example is named DbSvrXP.


Listing 3. Creating a Distributed Partitioned View
Exec sp_addlinkedserver @server='dbsvrxp',
      @srvproduct='',
      @provider='MSDASQL',
      @provstr='DRIVER={SQL Server};
SERVER=dbsvrxp;UID=linklogin;PWD=pw;Initial Catalog=Adventureworks2008'
—Set the server option for improved DPV performance
exec sp_serveroption dbsvrxp, 'lazy schema validation', true

Create View Sales.vw_Sales_Daily
 as
      SELECT * FROM Sales.Sales_2001
       UNION ALL
      SELECT * FROM dbsvrxp.Adventureworks2008.Sales.Sales_2002
       UNION ALL
      SELECT * FROM Sales.Sales_2003
       UNION ALL
      SELECT * FROM Sales.Sales_2004


The DPV created in Listing 3 contains only one remote table. The example could be further expanded to have each table in the UNION clause on a different remote server. Keep in mind that the DPV CREATE statement needs to be adjusted when run on the remote server(s). The tables that are local on one server are now remote on the other server, and those that are remote can now be local.

If the DPVs are properly defined, SQL Server 2008 attempts to optimize their performance by minimizing the amount of data transferred between member servers. The query processor retrieves the CHECK constraint definitions from each member table. This allows the query processor to map the specified search arguments to the appropriate table(s). The query execution plan then accesses only the necessary tables and retrieves only the remote rows needed to complete the SQL statement.

Data can be modified through a DPV as well. Updatable DPVs, which were introduced in SQL Server 2000, are still available in SQL Server 2008. Data modifications are performed against a view, allowing true transparency. The view is accessed as if it were a base table, and the user or application is unaware of the actual location of the data. If it is configured properly, SQL Server determines via the WHERE clause specified in the update query which partition defined in the view must be updated rather than updating all tables in the join.


Note

Data can be modified through distributed partitioned views only in the Enterprise and Developer Editions of SQL Server 2008

Thursday 9 August 2012

Partitioning in SQL Server 2008

Partitioning is a feature designed to improve the performance of queries made against a very large table. It works by having more than one subset of data for the same table. All the rows are not directly stored in the table, but they are distributed in different partitions of this table. When you query the table looking for data in a single partition, or just a few, then due to the presence of these different subsets, you should receive a quicker response from the server.

There has been a long history, in SQL Server, as this feature has evolved. It first started with partitioned views. A developer had to create different tables having same schema and use these tables through a UNION operation in a view. Now after 2005, your tables and indexes can also be partitioned. This feature is further improved upon in SQL Server 2008.

SQL Server only supports one type of partitioning, which is Range Partitions. More specifically I should say 'Horizontal Range Partitions'. This the partitioning strategy in which data is partitioned based on the range that the value of a particular field falls in. The other partitioning types are reference, hash, list etc. partitions, which are not supported currently in SQL Server.

Simply saying, when we partition a table, we define on which portion of a table a particular row will be stored. Now you must be wondering what would be the criterion that a specific row would be saved in a particular partition. There are actually rules defined for ranges of data to fill a particular partition. These ranges are based on a particular column; this is called the Partition Key. It should be noted that these ranges are non-overlapping. To achieve this, a Partition Function and a Partition Scheme is defined.

There might be many questions in your mind. Let's go step by step. We start with the terminologies used in Partitioning.

PARTITION FUNCTIONS

The Partition Function is the function that defines the number of partitions. This is the first step in the implementation of partitioning for your database object. One partition function can be used to partition a number of objects. The type of partition is also specified in the partition function, which currently can only be 'RANGE'.

Based on the fact about boundary values for partitions that which partition they should belong to, we can divide partition function into two types:

Left: The first value is the maximum value of the first partition.
Right: The first value is the minimum value of the second partition.

The syntax for the creation of a partition function is as follows:

CREATE PARTITION FUNCTION partition_function_name ( input_parameter_type )
AS RANGE [ LEFT | RIGHT ]
FOR VALUES ( [ boundary_value [ ,...n ] ] ) [ ; ]

e.g.
Generally the input_parameter_type is numeric. SQL Server supports only certain input_parameter_types. The list of partition functions defined can be obtained by using the SYS.PARTITION_FUNCTIONS catalog view:

SELECT * FROM sys.partition_functions

The first thing that you would want to do is to test whether your partition function is implemented as per your desire or not. Specially, we can check if it is working on boundary values. You can check it with the special function provided: $Partition. We test the partition function MyPartitionFunc2 created by us earlier. In this SQL, we are verifying to which partition, (Partition Key = 100), would belong to.

SELECT $PARTITION.MYPARTITIONFUNC2(100) [PARTITION NUMBER]

To find out the boundary values defined by partition functions, partition_range_values catalog view is available.

SELECT * FROM SYS.PARTITION_RANGE_VALUES

Though SQL Server does not directly support List Partitioning, you can create list partitions by tricking the partition function to specify the values with the LEFT clause. After that, put a CHECK constraint on the table, so that no other values are allowed to be inserted in the table specifying Partition Key column any value other than the 'list' of values.

PARTITION SCHEME

This is the physical storage scheme that will be followed by the partition. To define scheme, different file groups are specified, which would be occupied by each partition. It must be remembered that all partitions may also be defined with only one file group.

After the definition of a partition function, a partition scheme is defined. The partition scheme just like specifying an alignment for data i.e. it specifies the specific file groups used during partitioning an object. Though it is possible to create all partitions on PRIMARY but it would be best if these different partitions are stored in a separate file groups. This gives some performance improvement even in the case of single core computers. It would be best if these file groups are on different discs on a multi core processing machine.

The syntax for creating partition schema is as follows:

CREATE PARTITION SCHEME partition_scheme_name
AS PARTITION partition_function_name
[ ALL ] TO ( { file_group_name | [ PRIMARY ] } [ ,...n ] )
[ ; ]

The list of partition schemes can be obtained by using the SYS.PARTITION_SCHEMES.

SELECT * FROM SYS.PARTITION_SCHEMES

To get the list of all data spaces SYS.DATASPACES catalog view may be used:

SELECT * FROM SYS.DATA_SPACES

PARTITIONED TABLE

After creation of a partition scheme, a table may be defined to follow that scheme. In this case the table is called PARTITIONED. A partitioned table may have a partitioned index. Partition aligned index views may also be created for this table. These index and view may be based on different partition strategy (partition function and partition scheme).

There may be question in your mind if it is possible to partition your table using multiple columns. The answer may be YES or NO. Why? No, because there is no such direct support for this in SQL Server. Yes, because you can still do that by using persisted computed column based on any number of columns you want. It must be remembered that this is still a single dimension partition.

Now you might be wondering whether your existing tables could be partitioned or not. For partitioning your existing table just drop the clustered index on your table and recreate it on the required partition scheme.

CREATE TABLE
[ database_name . [ schema_name ] . | schema_name . ] table_name
( { <column_definition> | <computed_column_definition> }
[ <table_constraint> ] [ ,...n ] )
[ ON { partition_scheme_name ( partition_column_name ) | filegroup
| "default" } ]
[ { TEXTIMAGE_ON { filegroup | "default" } ] [ ; ]

SQL Server 2008 has introduced an exciting new feature, 'FileStream'. Remember if you have any column specified which is based on filestream data, you have to follow some additional steps to get things to work with partitioning.

PARTITIONED INDEX

We can also partition our indexes, and this should contribute to improved performance. If indexes are partitioned they serve as the local indexes for each partition. We can also go with Global indexes on a partitioned table without caring about the different partitions of the table.

It must be remembered that indexes can be partitioned using a different partition key than the table. The index can also have different numbers of partitions than the table. We cannot, however, partition the clustered index differently from the table. To partition an index, ON clause is used, specifying the partition scheme along with the column when creating the index:

CREATE NONCLUSTERED INDEX MyNonClusteredIndex ON dbo.MyTable(MyID1)
ON MyPartitionScheme(MyID2);

You can see that Mytable is indexed on the MyID1 column, but the index is partitioned on the MyID2 column. If your table and index use the same Partition function then they are called Aligned. If they go further and also use the same partition scheme as well, then they are called Storage Aligned (note this in the figure below). If you use the same partition function for partitioning index as used by the table, then generally performance is improved.

PARTITION ALIGNED INDEX VIEWS

Partition aligned index views allow to efficiently create and manage summary aggregates in relational data. The query results are materialized immediately and persisted in physical storage in the database. This is an extension of Indexed views which existed in SQL Server 2005. Earlier, it was difficult to manage index views on partitioned table. This is because switching in and out the partition was not possible as the data was not distributed in a partitioned way in the indexed view. Indexed views were required to be dropped before this operation. In SQL Server 2008, this became possible with the introduction of Partition Aligned Index Views. In this way Indexed views have now evolved to become 'Partition Aware'.

It is said that these types of index views increase the speed and efficiency of queries on the partitioned data. The following conditions should be true if an index view has to be partition aligned with the table on which this view is defined.

The partition functions of the indexes of the indexed view and table must define the same number of partitions, their boundary values and the partition must be based on the same column.
The projection list of the view definition includes the partitioning column (as opposed to an expression that includes the partitioning column) of the partitioned table.
Where the view definition performs a grouping, the partitioning column is one of the grouping columns included in the view definition.
Where the view references several tables (using joins, sub queries, functions, and so on), the indexed view is partition-aligned with only one of the partitioned tables.

These views can be local or distributed. The local partitioned index view is the one in which all partitions lie on the same SQL Server instance. For a distributed one, different partitions of tables, queried in single view, reside on different SQL Server instances across the network. The distributed partitioned views are specially used to support federation of SQL Server instances.

CREATE A PRACTICE DATABASE

Before looking at an example implementation of partitioning, let us create a new database to test this feature. Let us create a database with two file groups FG1 and FG2. Make sure that you create PartitionPractice folder in C directory before running the following query:

USE master
GO
CREATE DATABASE MyPartitionPracticeDB
ON PRIMARY
( NAME = db_dat,
FILENAME = 'c:\PartitionPractice\db.mdf',
SIZE = 4MB),
FILEGROUP FG1
( NAME = FG1_dat,
FILENAME = 'c:\PartitionPractice\FG1.ndf',
SIZE = 2MB),
FILEGROUP FG2
( NAME = FG2_dat,
FILENAME = 'c:\PartitionPractice\FG2.ndf',
SIZE = 2MB)
LOG ON
( NAME = db_log,
FILENAME = 'c:\PartitionPractice\log.ndf',
SIZE = 2MB,
FILEGROWTH = 10% );
GO
USE MyPartitionPracticeDB
GO
1. STEP # 01 (Create partition function)

CREATE PARTITION FUNCTION partfunc (int) AS
RANGE LEFT FOR VALUES (1000, 2000, 3000, 4000, 5000);
2. STEP # 02 (Create partition scheme)

CREATE PARTITION SCHEME MyPartitionScheme AS
PARTITION MyPartfunc TO
([FG1], [FG2])
GO

3. STEP # 03 (Create table or index based on the partition scheme)

The important clause related to partitioning is ON clause in CREATE TABLE statement. We create our table MyPartitionedTable as follows:

CREATE TABLE MyPartionedTable
(
ID int PRIMARY KEY,
Name VARCHAR(50)
)
ON MyPartitionScheme(ID)

GUI SUPPORT FOR PARTITIONING

Great GUI support is provided in SQL Server 2008 for partitioning. If you see the properties of a table, you can easily find partition related properties under the Storage tab.

There is no support of partitioning at the time of table creation, but later you can use the wizard for partitioning. There is a new partitioning wizard introduced in SQL Server 2008 Management Studio.

CREATE PARTITIONS

After the creation of a non-partitioned table, it can then be partitioned. To ensure the ease of partitioning, GUI support is available in SQL Server Management Studio. Not only can you create partitions in a non-partitioned table but you can also manage partitions in an already partitioned table.

Just right click a non-partitioned table and look at the following pop-up menu i.e. the Create Partition option under Storage menu:

When you select this option, an easy to use wizard is started to allow you to create partitions. Remember, you can also use already existing partition functions and scheme to partition a table. You can also create new ones and the wizard allows you to enter the details for the partition function, it allows the LEFT or RIGHT options, and for entering the details of the partition scheme. It also asks for the details of file groups for new partitions.

This wizard is very easy to use and allows the developer to generate partition functions, partition schemes and other options if none of the already existing ones fits the user's needs. There are also some third party solutions available to manage partitioning of your objects outside SQL Server Management Studio.

ADDITIONAL OPERATIONS WITH PARTITIONS

SQL Server also supports other operations with partitions. These operations help in managing a partitioned object. They are as follows:

1. Split
2. Merge
3. Switch

Remember that these operations are meta data operations and do not involve any movement of data.

ADDING A NEW PARTITION

This operation is supported to accommodate the continuous changes when a new Partition needs to be added if the already existing partitions are very large. This is technically called a Partition Split. ALTER TABLE statements support partition split with the required options. To split a partition, the ALTER PARTITION FUNCTION statement is used.

ALTER PARTITION FUNCTION MyPartitionFunction()
SPLIT RANGE (5000)

But before splitting, a new file group must be created and added to the partition scheme using the partition function being modified, otherwise, this statement would cause an error while executing.

REMOVING AN EXISTING PARTITION

The Merge operation is used to remove an existing partition. The syntax of MERGE statement is nearly the same as the SPLIT statement. This statement would remove the partition created in SPLIT command shown before this.

ALTER PARTITION FUNCTION MyPartitionFunction ()
MERGE RANGE (5000)

SWITCHING A PARTITION

This operation is used to switch a partition in or out of a partitioned table. It must be remembered that for switching in, the already existing partition must be empty within the destination table. In the following example, we are switching in a table MyNewPartition as 4th partition to MyPartitionedTable.

ALTER TABLE MyNewPartTable switch TO MyPartitionedTable PARTITION

In the following example we are switching partition 3 out to a table MyPartTable:

ALTER TABLE MyPartitionedTable switch PARTITION 4 TO MyNewPartTable

We partitioned our table because of large amount of data. In order to speed up data retrieval, we have incorporated several indexes on our table. This would certainly help in satisfying the query requirements of users, but it adds additional time while updating or adding records to the table given the large amount of data in the table. So it is better that we insert our records in an empty table, which is exactly same as the partitioned table. Insert new records in that table and then switch that partition into the partitioned table. We discuss this further as we will be discussing the Sliding Window Technique.

The requirement for the table when switching in and out of a partition means that both of them must also have same clustered and non-clustered indexes, and additionally, the same constraints.

PARTITIONING AND PARALLEL EXECUTION:

If you have multiple processing cores on your server then partitioning your large tables could also result in more optimized parallel execution plans by the database engine. It must be remembered that in SQL Server 2005, a single thread was created per partition (at max). So if there are less number of partitions then the number of processors, all the cores were not optimally utilized and some processors used to be sitting idle. With SQL Server 2008, this restriction has been removed. Now all processors can be used to satisfy the query requirement.

When SQL Server uses parallel execution plan for partition tables, you can see Parallelism operator in the execution plan of the query. In actual there are multiple threads working on different partitions of the table. Each partition is processed by a single thread. We can control parallel plan of our queries by using MAXDOP hint as part of query or plan guide. First we see how we could control the degree of parallelism using query hints:

SELECT * FROM MyPartitionedTable
ORDER BY OrderDate DESC
OPTION (MAXDOP 4)

The plan can also be created to support parallelism:

EXEC sp_create_plan_guide
@name = N'Guide1',
@stmt = N'SELECT * FROM MyPartitionedTable
ORDER BY OrderDate DESC',
@type = N'SQL',
@module_or_batch = NULL,
@params = NULL,
@hints = N'OPTION (MAXDOP 4)';

Remember if you are creating a custom plan guide for your query then you must not have specified (Maximum Degree Of Parallelism) MAXDOP = 1 as an option. Additionally the configuration option of maximum degree of parallelism for the server should be appropriately set.

sp_configure 'show advanced options', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO
sp_configure 'max degree of parallelism', 0;
GO
RECONFIGURE WITH OVERRIDE;
GO

You can also set the configuration settings of the instance from the Properties window of the instance selected from the object explorer in the SQL Server Management Studio.

MULTITHREADED PARTITION ACCESS

Threading support for partitions has also been improved in SQL Server 2008. Earlier in 2005 one thread was assigned to each partition. So if there are e.g. 40 threads available and table has only 2 partitions then only 2 threads may be utilized for accessing these partitions' data. In 2008, all the partitions are accessed by all the available threads. These threads access table partitions in a round robin fashion.

You need to make sure that the 'max worker thread' configuration is properly set for your SQL Server instance.

SELECT * FROM SYS.CONFIGURATIONS
WHERE NAME = 'max worker threads'

This feature is also important in the case of non-partitioned table. This is because back in 2005 if we had large amounts of data in a non-partitioned table then only one thread was available to it. Now in 2008 all the available thread could utilize it which results in boosting the performance of data access.

Many operations can be done in parallel due to partitioning like loading data, backup and recovery and query processing.

SLIDING WINDOW TECHNIQUE

SQL Server cannot have an unlimited number of partitions. The limitation is 1000 partitions in 2005. Based on this limitation we cannot switch in partitions forever. So, how to resolve this? The Sliding Window Technique is the answer.

According to this technique, a table always has some specified number of partitions. As the new partition comes, the oldest partition is switched out of the table and archived. This process can goes on forever. Additionally, this should be very fast to do since it is a metadata operation.

This technique could give a drastic boost in performance based on the fact that so long as data is being loaded into the source staging table, your partitioned table is not locked at all. You can even improve that using bulk inserts. After it is loaded the switching in process is basically a metadata operation.

This process is so regular that you can automate this using dynamic SQL. The tasks to be followed are as follows:

Managing staging tables for switching in and out the partitions.
Switching new partition in and old partition out of the table.
Archive staging table in which data is archived and remove both staging tables.