What Is Netezza / PureData for Analytics
Hey kids! Do you want to do structured relational queries to a massively parallel database with custom FPGA hardware!? Sure you do, come with me on a magical adventure! Ok so I have a more detailed description of Netezza architecture on a site where I also offer my services as a Netezza consultant. I’d like to seed the database section of this website with an article and this is simply the easiest and most interesting one I’ve already been using to begin writing about here, and it certainly has a relevant place within the database and analytics landscape.
You can read my description at the link above to get more details around the architecture, if we sum it up here though, the system is quite a lot of hardware gear, thrown in parallel at the problem of dealing with a large volume of data often found in data warehouses. The intent of the system’s design is to perform very quickly and to do so somewhat easily, providing DBAs and developers learn and apply the best practices for both building data models and a data warehouse, as well as applying the technology advantages that Netezza provides.
The platform is great for doing more advanced analytics on, because in reality, much prep work for these things are good old fashioned regular grunt work, such as cleaning and preparing data. On a large and massively parallel machine with specialized hardware, this makes huge tasks a breeze. It also allows you to offset much of the work that may be done slowly other places; ie. bringing data from the warehouse to prep and then model in SAS; NZ could prep the data inside itself, do aggregate or more advanced computational work, and provide SAS only what is necessary to take the next step. It offers a significant way to reduce network and disk input/output (i/o) operations.
The above is a 2-rack model, with 480 active hard drives, 224 FPGA cores, and 280 CPU cores. It has 96 TB of raw capacity, and all data on Netezza is compressed; it actually makes the system faster as the FPGAs do nearly real-time uncompression. Actual data capacity is about 3-4x this raw number; although likely the workload you ask of the system will have exhaust it’s performance capacity by the time you’d actually want to store that much data on it.
Not only that, but it has some greater analytic capabilities built inside it as well. It has a geospatial package for working with locations and geometries; it also has a lot of math functions available within it’s analytic toolkit with algorithms such as linear regression or k-means clustering available in the database. Additionally other languages and packages can be used on the box; custom functions can be written in many languages, such as java, c++, python, and R – among still others. So all in all, it’s an “old” database in that it’s a traditional relational structured database – but it can certainly still kick it with today’s big data buzzwords and tech. Likely much of the data you’d find interesting is already on it, plus it has custom hardware, massive scan speeds, and the flexibility to do a lot of analytic work actually on the machine.
Leave a Reply