Pentaho: The Solution to SSIS!
-
It’s a given that the business requirements of Extraction, Transformation, & Loading (ETL), are often broad-ranging and complex (regardless if it’s for a trans-departmental data migration into a unified data source, or the deployment of a new data-warehousing project for a large, multi-national concern [amongst others]), so any tool that helps to simplify the setup and execution of these processes is always welcome… provided that it actually works, of course…As a Customer Relationship Management (CRM) consultant of over 12 years, from my days working at, and then consulting with, GoldMine & HEAT, through to, and including, the debut and maturation of Microsoft CRM and Salesforce.com, and also as a systems, network, and database administrator as well, such tools are indispensable, and oftentimes fun, even if such a data transformation task should only boil down to the one-time use of an old stalwart VB script in Excel. Most of the time, however, it goes without saying that it’s a much more involved and multifaceted process than that. This is why tools like DTS played such an important part with Microsoft SQL 2000, for example.When Microsoft’s SQL Server Integration Services (SSIS) was introduced with MS SQL 2005, that data processing and integration capability (ostensibly) took an important step even further: with the introduction of a primarily graphical based, drag and drop simplicity, plus a whole retinue of pre-built package workflows and data transformations, the task of meeting an ever-increasing variety of demands for data transformation and integration never looked easier… until one actually got their hands dirty with it, that is.Just speaking personally, while at first I really liked all the options and flexibility SSIS appeared to provide (and in more than a few ways actually does), in many cases, for example, one still cannot take an old DTS package and quickly convert it to any semblance of a working package in SSIS, such as would be commonplace and expected for a solution upgrade of this magnitude. Also, the amount of time it takes to make, and then propagate, even minute changes to an area (such as a join), which happens to affect multiple subsequent transformation components residing downstream in the process’ data flow is, more-often-than-not, stupefyingly slow and onerous; often-times requiring one to “touch” (i.e. – open into the properties of) each and every pre-existing component along the way, just so that the changes applied upstream are applied properly to the component itself, even though no changes are actually even made since they are applied automatically anyway, albeit only once each and every component is manually opened first as described… If your head is spinning already at what I just described, you can take comfort in the fact that you actually didn’t misunderstand me. This makes a simple change to the package an extremely time-consuming event, and that’s just one example of how SSIS not only overcomplicates things which should be easy, but doesn’t even follow through on fixing major issues such as these in the multiple release iterations it has gone through since 2005!Other notable, long-standing oversights, that Microsoft clearly appears to have no intention of addressing, include an insistence upon providing a simple FTP component instead of the far more ubiquitous, and security minded alternatives of SFTP (and the like), unless one goes about installing a third party component or develops one for themselves. Another major issue is the overly complicated and time-consuming way in which the delivery of a custom package onto a production machine & environment happens to be. This often takes many more hours than it should and I have never yet seen one instance in which a new SSIS project deployment, onto a new production environment, did not in fact take at least several hours to implement, and work the kinks out of, in order to get up and running in the destination environment; the way in which the security authentication typically works in these package deployments often-times being the biggest hindrance to a quick and easy roll-out, with even a simple password protected package deployment scenario being considerably less than intuitive, much less straight forward or easy. The same bugs have in fact been around for so long that an entire auxiliary library of technical support-derived articles on how to “work-around” one deployment issue after another will soon be de rigueur for any would-be SSIS aficionado.This is not a comprehensive list, nor does it seek to be. These are simply just a few of the many reasons I believe SSIS does not appear to be the go-to data integration solution of choice it once held the significant promise of becoming any longer. For a list of painful oversights, such as has only been touched upon in the above, to go unheeded through what has amounted to at least 3 major product revisions since it’s original release speaks volumes about Microsoft’s intentions (or lack thereof) in actually addressing what could have truly been a tremendous product. The straw that broke the camel’s back for me was when I was trying to perform a simple database connection from our local MySQL database to a standard, remote instance of PostgreSQL.The term “Integration Services” really amounts to a misnomer when one is unable to make a simple data connection to one of the fastest growing alternative SQL database solutions in the market. Even more so when the touted drivers provided for the task still fail to remedy the situation, and this failure is compounded upon even further by the few freely available 3rd party solutions proving unequal to the task just the same. I know this isn’t the case for everyone, and I’m sure if I had dedicated just a few more hours or days to the task I’d probably have gotten a least something to (maybe) work like I needed it to to an at least partial degree, but the fact remains that something this basic should never require so much time to configure correctly to begin with. The literature our there on this subject reads like Ulysses, and why should anyone spend such a disproportionate amount of time on configuring what amounts to less than a hairs-width slice of the sum total involved in a project’s execution anyway?Connecting to another database carries only a few very simple variables for configuration, and if the driver that exists for facilitating that connection doesn’t even work, do I really want to continue riding that horse, even if I’m able to eventually beat it back onto its stumbling and neglected feet??Enter the freely available, open source ETL solution in the community edition of Pentaho Data Integration (formerly known as Kettle).When I finally discovered just how intelligently designed, full-featured, easy, and competent this java based ETL application actually was, I think I experienced something akin to what Moses’ followers must have felt when they saw him parting the Red Sea before them. Pharoahs be damned! I have finally been delivered unto the Promised Land!!Now I don’t want to over-exaggerate by saying Pentaho Data Integration (PDI) doesn’t have any problems whatsoever, because on occasion it has crashed, but in each case I had at least upwards of 10 separate data transformation design windows open simultaneously, and I’ve thus far only been working on the Windows edition, whereas the linux version is purportedly much more stable, and I’ve never been able to have that many separate package tabs open in SSIS anyway without it crashing just the same.After taking the time to properly configure the local workstation’s Java environment variables and paths configuration, the setup of PDI is really very simple (but do make sure to implement the Java configuration correctly before anything else).The first thing one notices is that, like SSIS, PDI also offers a graphical designer interface for the creation of data flows. While not as aesthetically refined as the interface available in MS SSIS, this is not a beauty contest, rather a utilitarian one. And it becomes immediately apparent, just by looking at all the design components, transformations, and database connections available, especially when one tests them and sees how they actually work, that Pentaho not only talks integration, it walks it too!Setting up the DB drivers for MySQL and PostgreSQL was simply a matter of downloading their respective .jar based driver files and copying them into the installation directory’s \pentaho\data-integration\libext\JDBC subfolder. Then, just like that, fully functional database connectivity between the two disparate database technologies was now established!Pentaho also makes it incredibly easy, via its built in wizard, to copy one table at a time, or multiple tables all at once. The system has been built so intelligently that if any tables with the same name as the one(s) being copied already exist on the destination database, Pentaho will automatically generate an ALTER TABLE script accordingly (instead of CREATE) in order to facilitate data transfer upon job execution. The system is a little too smart for its own good, actually, since if one wished to copy multiple tables into a target database very easily, the wizard is certainly the way to go, however, be advised that it won’t let you modify the destination table name with any distinguishing prefixes or alternative names of any kind, as the wizard assumes it will need to create each table first and copy it over using the same name throughout, so a little added granularity on that front would have been nice, but it’s still easier to make minor adjustments to a plethora of automatically generated transform operations (if need be), then to map them all out from scratch. If possible, try to have an empty database reserved just for the target tables (if they are to be used for staging, for example) and run an auto-truncate script every x hours depending on the frequency such an operation needed to occur, if more than once. Otherwise, if there are already tables with the same names in the target database as from the source, it will dynamically verify this at runtime, and then it WILL alter them and append its source data to whatever remains within the altered table(s). Of course, one can also just set up each operation manually very easily as well.For secure file transfers, PDI comes ready built with put and get operations for FIPS, SFTP, SSH2, and of course the totally unsecured FTP option as well, if you want it, but who wants to use that anymore (unless, of course, you fit Microsoft’s idea of the average user as someone who apparently doesn’t care about such things like security anyhow!)?Speaking of which, Pentaho also gives you the option to both encrypt and decrypt those precious files you’re securely transferring with PGP, and even to verify the signature(s) with PGP as well while you’re doing it. There are built-in input and output components for MongoDb, Json, Hadoop, Cassandra, Amazon EMR Job Executor, S3, Teradata, Google Analytics, SAP, Salesforce, XBase, Pentaho Reporting, and every other database and file format variety you would hope for and certainly expect from an ETL suite of winning proportions like this one. There’s even an automatic documentation generator to provide simple documentation of the jobs being created and executed. Do any of these exist, let alone come bundled with, SSIS?Considering how prominent and widely used they are becoming (if they haven’t become prominent already), it’s kind of odd that they aren’t included, but then that’s Microsoft I guess: always trying to corrall everyone into their own proprietary, and shrinking, marketspace. Need to incorporate lookup operations written in REST?You’ll have to build your own little app if you want to run it as part of an SSIS package, but PDI comes with its own REST client as part of its wide retinue of other lookup operations, out the gate. You can even run a credit card validation as part of the standard set of components that come ready-made in this freely available and robust ETL solution. You’d think I was trying to sell it, but what is there to sell really?This is open source software at its finest. And the documentation available for it isn’t lacking either. If you’re looking for a worthy alternative to Microsoft’s SSIS for data integration projects of any scope and size, you really owe it to yourself to give Pentaho Data Integration a try. I’m sure glad I did, and I can’t wait to try out all of its features and be there as it grows… even as I grow further away from the platform I encountered it on, and meet it on another ever increasingly challenging arena for Microsoft: that of the OS variety.