High Performance Direct SQL Reporting on SharePoint

by Jes Kirkup10. January 2013 08:15

Lets face it, SharePoint information is business information and nearly all of it is stored in lists of one sort or another. As part of the wider corporate data landscape there are many reasons why you might want to read and display this list data not only inside but also outside of SharePoint. The proper way to do this we are told is through web services (lists.asmx) or the object model API. Unfortunately these methods are unwieldy at best and certainly not intuitive to business decision makers who just want to get at the information for themselves to analyse. After all, these people know their way around reporting tools and understand their own data. There are also performance penalties to using these supported approaches.

The lists web service will choke and die if too much data is requested so you might also need implement custom paging, reading a chunk at a time if it is a large and/or wide list. Nice. Wouldn't it be a lot nicer if you could generate reports directly from SharePoint's content database with standard, user-friendly and robust SQL reporting tools such as Reporting Services? Wouldn't it be even better if you had the possibility for near realtime reporting, with data from several, possibly large lists and combining this with data directly from other data sources such as the corporate CRM, HR and ERP systems?

Not possible you say? Read on then ... Right, at this point I suspect that you may be thinking of a couple of rather large obstacles to achieving this goal:

  • The SharePoint database schema is indecipherable for most users. (Also, you are being very naughty even to be looking at it!)
  • Reading directly from the content database is explicitly not supported. (Which is why thou shalt not read the DB schema)

We will examine each of these obstacles in turn.

1) The SharePoint database schema is indecipherable for most users

When you look into the content database schema you will find that it is actually quite small in terms of number of tables. Since nearly everything is a list in SharePoint it turns out that all list data, and so most of the content is actually stored inside a single table (actually a View) called UserData. The trouble with this is that the fields are all named in a generic way. nvarchar1 nvarchar2 ... ntext1 ntext2 ... and so on. To make this stuff meaningful to end users we need to map the meaningful list field names somehow with the generic database field names.

Digging a little deeper you will find that all of this information is stored in a column called p_fields in the Lists table. Basically it is just a blob of XML containing the mappings. There is actually a string before the XML data containing the service release of the database.

<FieldRef Name="ContentTypeId"> </FieldRef>
<FieldRef Name="Title" ColName="nvarchar1"> </FieldRef>
<FieldRef Name="_ModerationComments" ColName="ntext1"> </FieldRef>
<FieldRef Name="File_Type" ColName="nvarchar2"> </FieldRef>
<Field ID="{ad334fed-5c36-4833-b327-2fc7238829c6}" Name="Owner"
       Description="The owner of the agenda item."
       SourceID="{6bb8807d-4d8a-4ff2-a513-d339e6657b36}"
       StaticName="Owner" Group="RNIB Columns" Type="Text"
       DisplayName="Owner"
       Version="1" Required="FALSE" Hidden="FALSE" Customization="" ReadOnly="FALSE"
       ColName="nvarchar3" RowOrdinal="0"> </Field>
<Field ID="{52578fc3-1f01-4f4d-b016-94ccbcf428cf}" Name="_Comments"
       SourceID="http://schemas.microsoft.com/sharepoint/v3"
       StaticName="_Comments" Group="Core Document Columns" Type="Note"
       DisplayName="Notes" Sortable="FALSE"
       Description="A summary of this resource" Sealed="TRUE"
       UnlimitedLengthInDocumentLibrary="TRUE" AllowDeletion="TRUE"
       ShowInFileDlg="FALSE" Customization="" ColName="ntext2"
       RowOrdinal="0"> </Field>
<Field RowOrdinal="0" Type="Choice" Format="Dropdown"
       FillInChoice="FALSE" Sealed="FALSE"
       Name="ContentType" ColName="tp_ContentType"
       SourceID="http://schemas.microsoft.com/sharepoint/v3"
       ID="{c042a256-787d-4a6f-8a8a-cf6ab767f12d}"
       DisplayName="Content Type" StaticName="ContentType" Group="_Hidden"
       PITarget="MicrosoftWindowsSharePointServices"
       PIAttribute="ContentTypeID">
  <Default>RNIBAgendaItem</Default>
  <CHOICES>
    <CHOICE>RNIBAgendaItem</CHOICE>
    <CHOICE>Folder</CHOICE>
  </CHOICES>
</Field><?xml version="1.0" encoding="utf-8"?>
<FieldRef Name="ContentTypeId"> </FieldRef>
<FieldRef Name="Title" ColName="nvarchar1"> </FieldRef>
<FieldRef Name="_ModerationComments" ColName="ntext1"> </FieldRef>
<FieldRef Name="File_Type" ColName="nvarchar2"> </FieldRef>
<Field ID="{ad334fed-5c36-4833-b327-2fc7238829c6}" Name="Owner"
       Description="The owner of the agenda item."
       SourceID="{6bb8807d-4d8a-4ff2-a513-d339e6657b36}"
       StaticName="Owner" Group="RNIB Columns" Type="Text"
       DisplayName="Owner"
       Version="1" Required="FALSE" Hidden="FALSE"
       Customization="" ReadOnly="FALSE"
       ColName="nvarchar3" RowOrdinal="0"> </Field>
<Field RowOrdinal="0" Type="Choice" Format="Dropdown"
       FillInChoice="FALSE" Sealed="FALSE"
       Name="ContentType" ColName="tp_ContentType"
       SourceID="http://schemas.microsoft.com/sharepoint/v3"
       ID="{c042a256-787d-4a6f-8a8a-cf6ab767f12d}"
       DisplayName="Content Type" StaticName="ContentType"
       Group="_Hidden"
       PITarget="MicrosoftWindowsSharePointServices"
       PIAttribute="ContentTypeID">
  <Default>RNIBAgendaItem</Default>
  <CHOICES>
    <CHOICE>RNIBAgendaItem</CHOICE>
    <CHOICE>Folder</CHOICE>
  </CHOICES>
</Field>

If you examine the schema in a nice XML editor (e.g. Stylus Studio) you will see that it is quite understandable. We have a set of either Field or FieldRef elements consisting of the logical name and database name plus a set of additional metadata further describing the list field type. In fact, all of the required information is present to construct a SQL View for the List. The listId is needed for the View SQL WHERE clause to filter the results to the correct list. This would be the most common case but is not at all the only possibility. For example the content type could be used in conjunction with the webUrl to facilitate aggregation style querying across lists.

Now, clearly, you wouldn't want to have to code any of this kind of stuff by hand. That would be time consuming, tedious and error prone right? Well, since the metadata is already in XML format, doesn't it make sense to use XSLT to do the job? It is actually very easy to do it. This is nothing more than code generation using XSLT templates and existing List XML metadata as the source. As I have already written a code generator that does this based upon SQL schema metadata, I figured it should be quite straightforward to extend it to use List metadata.

Aside from making the appropriate query to obtain the tp_Fields XML, the only thing I needed to do was to wrap the XML with a top level <data> element, passing in a default (list Url-based) Viewname and the list GUID as attributes and writing a simple XSL stylesheet to spit out the T-SQL View creation script. Here is a simple template to generate a SELECT which can be used to create SQL queries and execute them directly from the code generator:

<?xml version='1.0'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" />
  <xsl:variable name="viewName" select="//data/@viewName" />
  <xsl:variable name="listId" select="//data/@listId" />
  <xsl:template match="/">
    <xsl:text> SELECT </xsl:text>
    <xsl:apply-templates />
    <xsl:text>FROM AllUserData WHERE tp_ListId='</xsl:text>
    <xsl:value-of select="$listId"/>
    <xsl:text>'</xsl:text>
  </xsl:template>
  <xsl:template match="/data/Field">
    <xsl:if test="position()=1">
      <xsl:text>[</xsl:text>
      <xsl:value-of select="@ColName" />
      <xsl:text>]</xsl:text>
      <xsl:text> as </xsl:text>
      <xsl:text>[</xsl:text>
      <xsl:value-of select="@DisplayName" />
      <xsl:text>]</xsl:text>
    </xsl:if>
    <xsl:if test="position()!=1 and not(@Type='Computed')">
      <xsl:text> ,</xsl:text>
      <xsl:text>[</xsl:text>
      <xsl:value-of select="@ColName" />
      <xsl:text>]</xsl:text>
      <xsl:text> as </xsl:text>
      <xsl:text>[</xsl:text>
      <xsl:value-of select="@DisplayName" />
      <xsl:text>]</xsl:text>
    </xsl:if>
  </xsl:template>
  <xsl:template match="/data/FieldRef">
    <xsl:if test="position()=1">
      <xsl:choose>
        <xsl:when test="not(@ColName)">
          <xsl:text>[tp_ContentTypeId]</xsl:text>
          <xsl:text> as </xsl:text>
          <xsl:text>[</xsl:text>
          <xsl:value-of select="@Name" />
          <xsl:text>]</xsl:text>
        </xsl:when>
        <xsl:otherwise>
          <xsl:text>[</xsl:text>
          <xsl:value-of select="@ColName" />
          <xsl:text>]</xsl:text>
          <xsl:text> as </xsl:text>
          <xsl:text>[</xsl:text>
          <xsl:value-of select="@Name" />
          <xsl:text>]</xsl:text>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:if>
    <xsl:if test="position()!=1">
      <xsl:choose>
        <xsl:when test="not(@ColName)">
          <xsl:text> ,</xsl:text>
          <xsl:text>[</xsl:text>
          <xsl:value-of select="@Name" />
          <xsl:text>]</xsl:text>
          <xsl:text> as </xsl:text>
          <xsl:text>[</xsl:text>
          <xsl:value-of select="@Name" />
          <xsl:text>]</xsl:text>
        </xsl:when>
        <xsl:otherwise>
          <xsl:text> ,</xsl:text>
          <xsl:text>[</xsl:text>
          <xsl:value-of select="@ColName" />
          <xsl:text>]</xsl:text>
          <xsl:text> as </xsl:text>
          <xsl:text>[</xsl:text>
          <xsl:value-of select="@Name" />
          <xsl:text>]</xsl:text>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

It is really easy to display the results by grabbing the query result set, shoving it into a dataset and binding to a grid control. Here I decided to use the UltraWinGrid from Infragistics as its capable of handling most OLEDB data types which the OOB grid is not. As you can see this gives you a pretty neat way of browsing the content of your lists.

Taking a more complex example here is the generated T-SQL script for a view based upon a publishing page definition.

CREATE OR REPLACE VIEW vw_aboutus_aboutsightloss_famous 
AS ( SELECT [tp_ContentTypeId] as [ContentTypeId] ,
[tp_ModerationStatus] as [Approval Status] ,
[ntext1] as [Approver Comments] ,
[tp_LeafName] as [tp_LeafName] ,
[nvarchar1] as [Modified_By] ,
[nvarchar2] as [Created_By] ,
[nvarchar3] as [File_Type] ,
[nvarchar4] as [HTML_File_Type] ,
[nvarchar5] as [_SourceUrl] ,
[nvarchar6] as [_SharedFileIndex] ,
[nvarchar7] as [Title] ,
[nvarchar8] as [TemplateUrl] ,
[nvarchar9] as [xd_ProgID] ,
[bit1] as [xd_Signature] ,
[ntext2] as [Description] ,
[datetime1] as [Scheduling Start Date] ,
[datetime2] as [Scheduling End Date] ,
[int1] as [Contact] ,
[nvarchar10] as [Contact E-Mail Address] ,
[nvarchar11] as [Contact Name] ,
[nvarchar12] as [Contact Picture] ,
[nvarchar14] as [Page Layout] ,
[nvarchar16] as [Variation Group ID] ,
[nvarchar17] as [Variation Relationship Link] ,
[ntext3] as [Rollup Image] ,
[tp_ContentType] as [Content Type] ,
[ntext5] as [Page Image] ,
[ntext6] as [Page Content] ,
[ntext7] as [Summary Links] ,
[nvarchar19] as [Byline] ,
[datetime3] as [Article Date] ,
[ntext8] as [Image Caption] ,
[ntext9] as [Style Definitions] ,
[ntext10] as [Summary Links 2] ,
[ntext11] as [Keywords] ,
[nvarchar20] as [Promo title] ,
[nvarchar21] as [Main content] ,
[nvarchar22] as [Description] ,
[nvarchar23] as [Page title] ,
[nvarchar24] as [Page image caption] ,
[bit2] as [H2 anchor list] ,
[nvarchar25] as [Anchor list content] ,
[nvarchar26] as [Public contact] ,
[ntext12] as [Publishing content] ,
[ntext13] as [H2 links header] ,
[ntext14] as [Page image] ,
[nvarchar27] as [Content co-ordinator] ,
[nvarchar28] as [Content Author] ,
[datetime4] as [Date modified] ,
[nvarchar29] as [Copyright information] ,
[ntext15] as [Description]
FROM AllUserData WHERE tp_ListId='bbdc3fe5-8d79-48d8-b6ac-083e4293a274'

The nice thing about using SQL Views in this fashion is that they abstract us away from the underlying schema details. As far as any reporting tool is concerned it is just reading data from tables. This allows us to build a semantic model which can also be used for ad-hoc reporting purposes using business focussed terminology. As I will explain later it is possible to place all of these Views onto a separate report database which points to the "real" data. So when Microsoft come along and decide to change the schema for the lists, it will very likely entail only changes to the XSL mapper style sheet which is just a text file.

The entire set of View schemas could be very rapidly regenerated with minimal effort. Furthermore, a reporting schema (such as a Report Builder SDML based model) and any reports built from it would be remain completely unaffected. The other benefits of compiling SQL queries as Views is that they can be secured, and since they are part of the database they are accessible, reusable, and they get backed up with database backups. So that's the first obstacle addressed then, we have a robust, repeatable, template based method for generating SQL Views that can be used to a build a reporting "semantic layer".

2) Reading directly from the content database is explicitly not supported

Oh dear, so all bets are off then! Yep, on that second point according to KB841057 even reading from the content database is prohibited. So watch out folks, if you simply open and browse a SharePoint database in SQL Server Management studio you have technically just put your system into an unsupported state. Ok, so obviously writing data back through SQL or changing the database schema isn't too smart but simply reading data??

The argument goes like this. A SQL SELECT operation may result in "unexpected locks" occuring which could cause SharePoint to "malfunction". Of course you might argue that you could just use the "NO LOCK" SQL clause to prevent any such locking issue. However, doing so causes the SQL query to operate in the lowest isolation level possible (Dirty Read). This means that uncommitted transaction data could be returned. In fact for a large list, the results returned when the query has finished executing may not represent the state of the list at any point in time. You may be happy to live with this limitation, but for many organisations this would be a no-no. Is there a solution to this? It turns out in fact that there are a number. A large part of the task of configuring SharePoint for enterprise deployments involves setting up the SQLServer database for high availability.

Aside from failover clustering, there are two ways that this is commonly achieved, database mirroring and log shipping.

Database mirroring

Database mirroring increases database availability and can support near instant failover by continuously sending a databases's transaction logs from an originating SQL server instance (principal) to a destination "warm standby" instance (mirror). The mirror database is created by restoring a database backup of the principal database with "no recovery". This makes the mirror database accessible to clients.

Reporting from the Mirror

However, you can still take advantage of the mirror database that is being used for high availability to offload reporting. As a matter of fact why not make the darned thing earn its keep? Its going to be doing very little most of the time otherwise! To use a mirror database for reporting, you create a database snapshot on the mirror database. The snapshot isolation level was first introduced in SQL Server 2005, specifically to support transaction consistent, point in time reporting. Database snaphots use the same technology but instead of a single query context an entire virtual database is created. Snapshots databases are created quite simply using the following syntax:

CREATE DATABASE iSite_Snapshot ON (NAME = WSS_Content_ISite, FILENAME='C:\TEMP\Isite.ss') AS SNAPSHOT OF WSS_Content_ISite

It is very quick to create a new snapshot database as the physical snapshot file will initially be empty. This is because the snapshot only stores data that is different between the two databases and that will be zero upon creation. It is also very quick to drop a snaphot which of course would have no affect on the mirror source database.

DROP DATABASE iSite_Snapshot

The mirror maintains the snapshot database by performing "copy on write" operations. When the mirror database commits transactions from the principal, it will take a copy of the data before it is changed and that gets written to the snapshot. In this way the snapshot only stores the data that is changed on the mirror. All other data pages are shared with the mirror.

Unlike the mirror database itself, a database snapshot is accessible to clients. As long as the mirror is communicating with the principal, reporting clients can connect to a snapshot. As a database snapshot is static, to make current data available, a new database snapshot must be taken periodically and applications directed to the incoming client connections to the newest snapshot. See http://msdn.microsoft.com/en-us/library/ms175511.aspx So as with most things, there are some real world tradeoffs to take into consideration:

  1. Snapshot performance implications especially if there are a number of active snapshots.
  2. Managing the refreshing of snapshots and keeping track of which is the latest snapshot. This wouldn't be an issue if there were only one snapshot that was peridically dropped and recreated.
  3. Failover situations for the infrequent and temporary occasions when a database failover occurs. During this time the mirror server would be unavailable for reporting until the principal server was back online.

Log shipping

Similar in concept to database mirroring, log shipping also operates at the database level. You can use log shipping to maintain one or more warm standby databases for a corresponding production database that is referred to as the primary database. Standby databases are also referred to as secondary databases. Each secondary database is created by restoring a database backup of the primary database with no recovery, or with standby. The standby server in a log shipping scenario can either be inaccessible or read-only. Restoring "with standby" lets you use the secondary database for reporting.

Cheap and cheerful

Finally, if you can live with the latency, how about a good old fashioned restore of last night's backup onto a separate read only instance?

What about security, versioning and approval?

Interesting question. Well you will have observed that we can pretty much bypass security if we want to. Truth of the matter though, is that if we are building an enterprise reporting platform we are likely to be using a number of other data sources aside from MOSS and each with their own security. Typically data from ERP systems, CRM, Payroll and so in.

Indeed this architecture allows us to do exactly that. The point is that we will very likely be operating within a much wider security context than SharePoint list permissions. Many decision support/BI systems also have their own security model particularly in a data warehousing scenario. There is nothing to stop us from joining to the USERINFO table in our views.

SELECT USERDATA.NVARCHAR1 AS TITLE, 
USERDATA.TP_EDITOR AS EDITORID, 
LISTS.TP_TITLE AS  LIST_TITLE, 
USERINFO_EDITORS.TP_LOGIN AS USERLOGIN,
USERINFO_EDITORS.TP_TITLE AS EDITORNAME, 
USERINFO_AUTHORS.TP_EMAIL AS  AUTHOREMAIL, 
USERINFO_AUTHORS.TP_TITLE AS AUTHORNAME,
WEBS.FULLURL AS SITEURL, WEBS.TITLE AS  SITETITLE 
FROM USERDATA INNER JOIN LISTS ON 
USERDATA.TP_LISTID = LISTS.TP_ID 
INNER JOIN USERINFO AS USERINFO_AUTHORS 
ON USERDATA.TP_AUTHOR = USERINFO_AUTHORS.TP_ID 
INNER JOIN USERINFO AS USERINFO_EDITORS 
ON USERDATA.TP_EDITOR =  USERINFO_EDITORS.TP_ID 
INNER JOIN WEBS ON LISTS.TP_WEBID = WEBS.ID 
WHERE (LISTS.TP_ID = 'LISTGUID') 

The same holds true for the tables that store permissions. Lets not forget that ultimately SharePoint is reading this data through SQL so the required data is all there. There is nothing preventing you from using the SQL Trace Profiler to find out what SQL is actually being executed on the Content DB. So theoretically we could use the SharePoint permissions structure if we really needed to, it would just be more effort to do so. As for versioning, if a list is versioned then there would be additional filtering to ensure we are only looking at the latest version. The approval state is simply a status field on the list.

Conclusion

With templated SQL Views and database mirroring we have the potential to integrate SharePoint content directly into a low latency, high performance BI solution that is also 100% supported. If you stop and think for a while there are a whole host of other applications, aside from Reporting Services, that could be built on top of such an architecture.

Tags: , , , ,

Jes Kirkup

KTC Limited

RecentPosts