20100130

Query multiple datastores with MetaModel 1.2

I am currently packaging and distributing the new version of MetaModel - version 1.2. In this blog-post I'll introduce what I think is the most exciting thing introduced in this version: Composite DataContexts aka. "Query multiple datasources with a single query". Or in plain english: You can now treat multiple datastores as if they where one.

An example:
Imagine that you want to match a database table with the contents of an excel spreadsheet, you can easily create a query that reads from both datastores and does all the joining, filtering etc. that is possible with regular MetaModel queries.

DataContext database = DataContextFactory.createJdbcDataContext( myConnection );
DataContext spreadsheet = DataContextFactory.createExcelDataContext(new File("my_spreadsheet.xls");

Table dbTable = database.getDefaultSchema().getTableByName("my_db_table");
Column dbPkColumn = dbTable.getColumnByName("my_primary_key");
Table excelTable = spreadsheet.getDefaultSchema().getTableByName("my_sheet");
Column excelFkColumn = excelTable.getColumnByName("my_foreign_key");

// now we create a composite DataContext which enables us
// to explore and query both DataContexts transparently
// through the same DataContext reference!

DataContext composite = DataContextFactory.createCompositeDataContext( database, spreadsheet );

// example query with carthesian product and cross-datastore where clause
Query q = new Query();
q.from(dbTable).from(excelTable);
q.select(dbTable.getColumns())
q.select(excelTable.getColumns());
q.where(dbPkColumn, OperatorType.EQUALS_TO, excelFkColumn);

DataSet ds = composite.executeQuery(q)
// read the result

... How cool is that?

Of course if a query is posted to the composite DataContext that spans multiple underlaying DataContexts, it will most likely spawn a case of "client side joining" which will not perform well compared to co-locating the datastores. But often that is not possible (or practical if it's just a case of ad-hoc analysis) so I believe that the new composite DataContext feature can add some real value to a lot of projects!

Other notable news in MetaModel 1.2: We now support MS Access databases and dBase (.dbf) database-files.

20091130

Setting up Mondrian for JNDI DataSources, XML/A and custom CSS styles

The other day I decided that I wanted to set up mondrian as an adhoc analysis package for Lund&Bendsens intranet application, "Yacs". I didn't want to install a large application like Pentaho for just this specific need - rather I wanted to deploy just a simple cube schema, reuse the Java EE datasource definition that the intranet-application was already using and apply some basic styling to comply with the corporate profile. Doing these steps showed a lot more complex than I first imagined, primarily because I think the examples of the standalone mondrian distribution are overly complex and poorly designed, incapsulation-wise. Here is a list of steps I recommend doing to set up Mondrian "the right way":

  1. Deploy your Java EE datasource in a container/database-specific way. If you JBoss and MySQL like I do, here's an example datasource descriptor, place it in the deploy-folder with a filename like "mydatasource.xml" (bold parts should be replaced with your specific configuration):
    <datasources>
       <local-tx-datasource>
         <jndi-name>MyDataSource</jndi-name>
         <connection-url>jdbc:mysql://localhost/mydatabase</connection-url>
         <driver-class>com.mysql.jdbc.Driver</driver-class>
         <user-name>username</user-name>
         <password>password</password>
         <min-pool-size>5</min-pool-size>
         <max-pool-size>20</max-pool-size>
         <valid-connection-checker-class-name>
           org.jboss.resource.adapter.jdbc.vendor.MySQLValidConnectionChecker
         </valid-connection-checker-class-name>
         <metadata>
           <type-mapping>mySQL</type-mapping>
         </metadata>
       </local-tx-datasource>
    </datasources>
  2. Unzip the mondrian.war archive so you can edit the application.
  3. Add a container specific mapping of the DataSource to this application. In JBoss this is done by placing a file called "jboss-web.xml" in the WEB-INF folder with this content:
    <jboss-web>
       <resource-ref>
         <res-ref-name>MyDataSource</res-ref-name>
         <res-type>javax.sql.DataSource</res-type>
         <jndi-name>java:/MyDataSource</jndi-name>
       </resource-ref>
    </jboss-web>
  4. Now edit the WEB-INF/web.xml file and add the following entry inside the <web-app> element:
    <resource-ref>
       <res-ref-name>MyDataSource</res-ref-name>
       <res-type>javax.sql.DataSource</res-type>
       <res-auth>Container</res-auth>
    </resource-ref>
    </web-app>
  5. Also change the mapping of the JPivot filter so it goes like this:
    <filter-mapping>
       <filter-name>JPivotController</filter-name>
       <url-pattern>/*</url-pattern>
    </filter-mapping>
  6. Create a schema-file and save it under WEB-INF/mycatalog.xml. I won't give instructions as to writing schemas - Mondrians documentation cover this quite well.
  7. If you want to enable XML/A support, use this as a template for your WEB-INF/datasources.xml file (notice here that we use the application-local JNDI string here (including java:comp/env/...)):
    <datasources>
       <datasource>
         <datasourcename>Provider=Mondrian;DataSource=MyDataSource;</datasourcename>
         <datasourcedescription>My example datasource</datasourcedescription>
         <url>http://localhost:8888/mondrian/xmla</url>
         <datasourceinfo>Provider=mondrian;DataSource=java:comp/env/MyDataSource;
         </datasourceinfo>
         <providername>Mondrian</providername>
         <providertype>MDP</providertype>
         <authenticationmode>Unauthenticated</authenticationmode>
         <catalogs>
           <catalog name="MyCatalog"
             <definition>/WEB-INF/mycatalog.xml</definition>
           </catalog>
         </catalogs>
       </datasource>
    </datasources>
  8. The views on the cube that you now want must be created as individual JSP pages. One of the things that are really lacking in the mondrian bundle is reasonable JSP pages with less complexity and reasonable reuse of datasources. Here's how I build mine (you can more or less put this stuff into the testpage.jsp page and then you're not dependent on all the jsp include stuff) - notice now that the datasource-reference entered here is the just the last, ie. the reference in web.xml:
    <% if (session.getAttribute("query01") == null) { %>
      <jp:mondrianquery id="query01" datasource="MyDataSource" cataloguri="/WEB-INF/mydatasource.xml">
       <!-- Initial MDX query goes here -->
      </jp:mondrianquery>
    <% } %>
Horray! Now you've got connection pooling, sharing and all the other cool stuff that Java EE DataSources provide. Next step: Add styling. First: Remove all the stylesheets that come with mondrian. You don't need them because it's actually quite a lot easier to add your own than to try and modify the existing ones. Here's the result of 10 minutes of styling:



The important CSS id's and classes are:

  • To control styling of the pivot table and its cells it's important that you add <div> around this tag in the JSP:
    <wcf:render ref="query01" xslUri="/WEB-INF/jpivot/table/mdxtable.xsl" xslCache="true" />
    You can then use your div's id-attribute to target cells, headers etc. in your stylesheet.

  • The class .heading-heading: Used for headings of headings, ie. the top-level blue cells in the screenshot above.

  • The classes .column-heading-span, .column-heading-even, .column-heading-odd: Used for column headers, ie. the gray cells above the pivot table content.

  • The classes .column-heading-span, .column-heading-even, .column-heading-odd: Used for column headers, ie. the gray cells above the pivot table content.

  • The classes .row-heading-span, .row-heading-even, .row-heading-odd: Used for row headers, ie. the gray cells to the left of the pivot table content.

  • The classes .cell-even and .cell-odd: Used for the cells on even and odd rows.

I encourage the mondrian crew to clean up the reference application, but I'm guessing they are using the messy configuration to convince people to switch to a full Pentaho deployment :-)

20091109

JPA and the N+1 select problem

Warning to readers: This blog entry contains references to articles only available in Danish. So if you keep on reading be prepared to weep if you want to follow my suggestions ;-)

Lately I've been working hard on Lund&Bendsens intranet and processes involved around it. I've been using JBoss Seam for the most part and overall I'm quite thrilled about this choice of web framework. One of the cool parts about Seam is the way it integrates with Java Persistence API (JPA)/Hibernate and handles my persistence context even when I'm rendering the views for the intranet.

At the same time when I have been developing the intranet features in Seam and JPA, my colleague Kenn Sano wrote an excellent article about the N+1 Select Problem in JPA. Here's what it all comes down to (In Danish):

"Man kan [med JPA] "vandre rundt" i en objektgraf og på magisk vis hentes data, som stilles til rådighed i takt med, at vi traverserer - dvs. objekters tilstand indlæses fra databasen alt imens vi bevæger os rundt i objektgrafen. Hvis man ikke er opmærksom på, hvordan JPA fungerer, kan det resultere i mange SQL-kald mod databasen, hvilket kan have stor negativ indvirkning på performance."
I was doing this plentifully in Seam. When presenting a list of courses or a list of students, Seam lets me easily traverse the items in the list by using EL-expressions such as "#{course.location.address}" which involved several N+1 performance penalties. For instance on the list of all planned courses, including their locations, enrolled students etc. I observed a whole of N*4+1 query penalty. So there's no doubt you need to be aware of the impact of your querying strategy.

Note: I'm not blaming JBoss Seam for this behavior ... Seam makes everything a whole lot easier and when everything is easy you just tend to forget to think yourself ;-) Anyways - go read the article if you're interested in JPA and understand Danish.

20090908

New book on Open Source Business Intelligence tells the DataCleaner-story

About half a year ago we received an exciting inquiry from Jos van Dongen on behalf of him and his co-author Roland Bouman, telling us that they where writing a new book about Open Source Business Intelligence and in particular Pentaho-based solutions. And for this they where looking into DataCleaner for the data profiling section of the book!

The book is now out! It's called "Pentaho Solutions" and it's published by Wiley Publishing. You can read about it and buy it on their website as well.

The book contains a walkthrough for building a data warehouse using Open Souce tools and in doing so applying DataCleaner for the important job of profiling and validation.

We congratulate Roland Bouman and Jos van Dongen for their great work to promote Open Source Business Intelligence and thank them for mentioning DataCleaner while they're at it!

20090714

Open Source Data Quality with DataCleaner 1.5.2

Today I've announced the release of DataCleaner 1.5.2, yay! I'm pretty excited about this release as I think this is probably the biggest of the minor releases to date. And especially I hope to see that our new "single jar file" distribution-option will attract new users. Go read the announcement for more details now :)

20090628

Introducing AnalyzerBeans

It's been some time now since I first designed the core API's of the DataCleaner project and as time goes on, some of my initial assumptions about the design of profilers, validation rules and so on have shown to be less-than-optimal in regards to flexibility and scalability for the application. This is why yesterday I decided to do a major change in the roadmap for the project:

  • The idea about the "webmonitor" application (DataCleaner 2.0) have been cancelled for now. If anyone wants to realize this idea it's still something that I am very much interested in, but as you will see I have found that other priorities are more important.
  • A new project have been founded - for now as a "sandbox" project: AnalyzerBeans. AnalyzerBeans is a rethought architecture for datastore profiling, validation etc. - in one word: "Analysis". When this project is stable and mature we will probably be ready for something I like to think of as a new DataCleaner 2.0.
So why rethink datastore analysis? Because the "old way" have proven to be very cumbersome for some tasks that I did not initially realise would have importance. The current DataCleaner design assumes that all profiles, validation rules etc. do serial-processing of rows. This is not always the best way to do processing although it simplifies optimization of the execution-mechanism because all components execute in the same way and can thus share result sets etc. In AnalyzerBeans we want the best of both worlds: Flexibility to do al sorts of weird processing and rigidity for the lot of profilers which actually do process rows serially.

The solution is a new annotation based component-model. Each profiler, validation rule etc. will not have to implement certain interfaces because we can now mix and match annotations to the specific type of analysis-component - each "AnalyzerBean". There are a lot more interesting features available when we introduce an annotation-based model, but let me first give you a simple example of how a regular row-processing DataCleaner-style profile would look like:
@AnalyzerBean(name="Row counter", execution=ExecutionType.ROW_PROCESSING)
public class MySerialCounter {

    @Configured("Table to count")
    private Table table;
    private long count = 0l;

    @Run
    public void run(Row row, long count) {
        this.count += count;
    }
}
Now this is not so impressive. I've just replaced the IProfile interface of DataCleaner's API's with some annotations. But notice how I've gotten rid of the ProfileDescriptor class which was used to hold metadata about the profiler. Instead the annotations represent the class metadata. This is actually excactly what annotations are for :-) Also notice that I've gotten a type-safe configuration-property using the @Configured annotation. This means that I don't have to parse a string, ask for a Table of the corresponding name etc. And the UI will become a LOT more easy to develop because of type-safe facilities like this.

But an even more exciting way to use the new API is when creating a whole new type of profiler, an exploring AnalyzerBean:
@AnalyzerBean(name="Row counter", execution=ExecutionType.EXPLORING)
public class MySerialCounter {

    @Configured("Table to count")
    private Table table;
    private Number count;

    @Run
    public void run(DataContext dc) {
        DataSet ds = dc.executeQuery(new Query().selectCount().from(table));
        ds.next();
        this.count = (Number) row.getValue(0);
        ds.close();
    }
}
Now this is something totally new: A component that can gain total control of the DataContext and create it's own query based on some @Configured parameters. I imagine that this programming model will give us complete flexibility to do exiting new things that was impossible in the DataCleaner-framework: Join testing, non-serial Value Distribution etc.

There are a few other annotations available to the AnalyzerBean-developers but I will take a look at them in a more in-depth blog-entry later. For now - let me know if you like the ideas and if you have any comments. Anyone who would like to help out in the development of the AnalyzerBeans project should visit our wiki page on the subject.

20090613

Performance benchmark: DataCleaner thrives on lower column counts

Today I've conducted an experiment. After fixing a bug related to CSV-file-reading in DataCleaner, I was wondering how performance was impacted by different kinds of CSV file compositions. The reason that I suspected that this could impact performance is that CSV files with many columns will require a somewhat larger chunk of memory in order to keep a single row in memory compared to CSV files with fewer columns. In the older versions of DataCleaner we discovered that using 200 or more columns would actually make the application run out of memory! Fortunately, this bug is fixed, but there is still a significant performance penalty, as this blog post will hopefully show.

I auto-generated three files for the benchmark: "huge.csv" with 2.000 columns and 16.000 rows, "long.csv" with 250 columns and 128.000 rows and "slim.csv" with only 10 columns and a roaring 3.200.000 rows. Together, each file has 32.000.000 cells to be profiled. I set up a profiler job with the profiles Standard measures and String analysis on all columns.

Here are the (surprising?) results:

filename rows columns start time end timetotal time
huge.csv 16000 2000 18:54:48 19:40:2845:40
long.csv 128000 250 19:44:53 19:52:317:38
slim.csv 3200000 10 19:53:46 19:55:031:17


So the bottom line is: Lowering the number of columns has a very significant, positive impact on performance. Having a lot of columns means that you will need to hold a lot more data in memory and needless to say you will have to replace this large chunk of memory a lot of times during the execution of a large profiler job. Going all the way from 45 minutes to 1½ is quite an improvement - so don't pre-join tables or anything like that before you run them through your profiler.