Recently I worked for a company I will call P which had a requirement to take 500+ gigabytes of unorganized chaotic data, put some organization to it, stored it on a RAID, and then provide access to it. The goal of processing this amount of data was to do so in as few of hours as possible, but as the software based on a certain design was undergoing early use the performance was terrible.
This document outlines the factors effecting the performance but it isn't a discussion about software techniques. It isn't a discourse on how to use certain languages or hardware to enhance performance, or how to design a system up front correctly. No, this document describes matters which really seem to go unseen in companies these days. I am going to focus on many other matters, some of which will touch on design and code perhaps, but ultimately will provide information helpful to the reader for future projects similar to this.
I also will focus heavily on a particular library I will call X as it was an interface library to the HDF library developed by NCSA which would be the final storage format for the data. Library X then became a focused point of performance for everyone involved as it was the primary call interface to and from the data on the RAID.
In order to adequately analyze performance issues a controlled environment was necessary. The necessity of this is due to the existence of many layers of software to accomplish the desired results, as well as the hardware. Each layer had to be evaluated as to it's effects in the overall performace circumstance. This would have ultimately lead to the enhancement or replacement of any given layer or component necessary to achieve desired results.
At company P a controlled environment was NEVER provided. This likely had the greatest effect on the end results of desired performance goals both in actual software and hardware performance as well as time consumed to obtain the results. At company P requests to a computing support group led to completly unsatifactory results. Refer to Computing and Network support groups section for further elaboration on this.
A controlled environment is used in most areas of scientific and technical endevour. It is essential.
Probably the biggest software layer where suspicion existed for the performance problems being seen at company P would be here. The use of Network File System and Raid control software provided a very chaotic circumstance which effected any operation performed aside from project software necessesities.
These layers were CONTANTANLY under administration and/or being worked on or rebuilt. There were times when one would issue an ls or cp (of small trivial file) unix command across a mount point serviced by these layers and would have to wait upwards of 20 minutes for the command to complete. Core dumps across the service would result in the necessity to reboot the machine of execution.
Due to lack of controlled environment this layer could never be isolated out of the testing scenario and therefore was the most suspect layer to the actual effect on performace.
It should also be noted that the use of NFS for the purposes it is used in relation to the project at company P is not recommended. IE, rapidly moving large amounts of data across.
At company P library X was developed which was called by key processes to store the data and retreive it. Library X in turn used the HDF library by NCSA. It was designed to be a serial process which interfaced to the HDF library to generate files stored in HDF format. Clients of library X were well into the developement stage when development on the ultimate implemention started, even to the point of calls to the library. This of course is not a recommended order of development. There was considerable resistence from those developing the clients of the library for any changes. This was tested by way of changing the existing interface without doing much to change the actual use and design. This resistence therefore became a strong criterion for the implementation of the library.
Library X should have been designed, coded, tested and verified to fulfill the performance requirements BEFORE any clients were every written to the library. It is highly likely that design criteria of this library would have effected design criteria for the clients of the library. For example, if the only way to achieve performace goals were to make this library a multi process solution, the clients of the library would have in turn needed to be mindful of the multiprocess scenario under which they operated.
As a result of the order of which components were built library X became contrained on both ends from an appropriate design and evolution. It was constrained by clients which were already predetermined in expectation of it's use, and by the use of the HDF library.
The effects of the constrained design of library X could never be hurdled despite efforts via email forums and meeting discussions.
When I first started at company X, the project was under a management criteria of meeting a prototype scenario. This was a scenerio whereby flow of the system could be demonstrated from begin to end on specific data available at that time. This goal was to be met December 31, 2005.
Though work on library X and it's clients were under way when I started, a rewrite of library X was started and very rapidly developed to meet the goal of this scenario. Starting from scratch on the library was necessary since an initial developer of the library kept explaining to be confused, and as well the code would not build. Further the usual use of HDF would prove to be incomprehensible, and a design around a higher level interface to HDF was to be used.
The goals of this prototype scenario undermined all necessary steps to build library X appropriately. No education of the HDF library was possible. Library X became the learning ground of a very involved HDF library. No performance evaluations could be performed given the time constraints. A predetimination to use HDF had already taken place.
No understanding of the data iteself could be analized to leverage it's structure, frequency or any other attributes to benefit the design of library X. Not until a year later was it understood that it's possible by way of understanding the data to dramatically enhance the performace requirments. However, understanding the data is in and of itself very complex. Specific data types were only a buzz word and nothing existed to prepare for what was to come.
Though it would seem that this scenario were merely a known prototyping phase, it would actually turn out that there was an expectation and general ideal that it was an actual accomplishment of completion. This is far from the truth as most of the code was, and STILL IS, merely framed code with many possible points of failure. It would appear that project management would determine that if valid data could flow through a process then it could be called complete (by way of the the "it's working" mindset). This of course is not an advised mindset.
The effects of this prototype scenario then weighed very heavily in the performance results. Library X should have been re-visited from scratch once again taking appropriate steps to achieve success of the library once the prototype scenario had been accomplished. This would as well then be true of library X clients.
Library X was designed to use the Hierarchical Data Format (HDF) library from the National Center for Supercomputing Applications (NCSA) at the University of Illinois. This was a decision made way too early in the project. The purpose, capabilities, etc of the libary can be researched online starting with the web site devoted to the library, http://www.hdfgroup.org/.
After evaluating the normal API for the libary it was decided to stick with a recommended higher level API. There was an ongoing concensus that the high level API would meet the desired performance necessities of the team. This was replicated and reinforced by the HDF team as well as project managment. Later determinations would at least begin to prove that perhaps this wasn't entirely true. However this would be too late to have any effect.
The normal HDF API is extremely difficult to use. It is very highly grained, meaning that for any particular operation, many calls need to be made. As well, for even slightly different variations on operations, there would likely be variations on which calls were made, how many calls, parameter lists, etc.
To further complicate the HDF library it is very heavily effected by a whole variety of configurable entities. These include cache size, chunk factors, flushing parameters, number of records per operation, etc.
All complexities considered the library could only effectivly be used by someone who has very high experience with the library. This is especially true for something like what this project was trying to do which tended to push the library to it's limits. No one on this project contained the experience necessary to effectivley use the HDF Libary. The effects of this then weighed very heavily in performance results.
It should be noted that a number of recommendations towards alternatives were rendered via email or meetings. The only effort followed through was the use of a btree library using the same design. Alternative approaches and designs should have been undertaken as early as early to mid 2005.
It should also be noted that many tests were run on the HDF library tweaking parameters as advised by the HDF team which ultimately lead to desirable results. A 600gig HDF file was created from a 500 gig input file distributed over 7500 datasets in less than 4 hours. However these results were acquired using a small test program and the library X / HDF libraries, and were only the results of writing. Combined with the actual clients of reading and writing, it would appear the resource usage or other factors being outlined would gravely effect the performance of desired operations.
Without a thorough understanding of the HDF library it was necessary to communicate with the team for advise in attempt to use their library to meet the project needs. Ultimately, of course this failed, at least in it's timeliness.
The primary interface to the team was more just a liason who carried our questions to the lead developers. This is hardly a good mechanism for getting polished results. Many times it would seem we were merely QA testing for the latest release of HDF. We would be instructed to make this change and that tweak, rebuild and see what happens.
Also, the team is funded by grants and supporters of their library. It would appear that without financial backing the teams support is minimal. This could be by design.
All design criteria for the prototype scenario centered around lagacy data format with only hints about what would be coming from the newer system data formats. The injection of newer systme data into the prototype system without re-design steps would prove very problematic, especially to library X, data storage format, and general understanding.
Library X used binary trees to distribute dual data entity combinations in memory. Each key for data a and data b were integers. Further, the storage method and retieval method for legacy data was straight forward. The new system data introduced keys which turned out to be variable length strings. As well, data was stored based on a variety of involved bit masks which were extremely difficult to understand.
Ultimatley, hacks made to library X by many team members in reaction to new system data injection had detrimental effects. One such effect is each key in what was a data b node (integer) became a variable string. This varible string then necessitated string operations which were very time consuming in contrast to similar operations performed more quickly on integers. Further, processing of the data based on the bit masks and other new system data criteria, had to be aggregated to the clients which resulted in many more calls being necessited to library X. This also effected performance, by many factors.
At one point in the fullfillment of requests there was actually a network boundary in between 2 components. A modification was performed to eliminate this boundary where each component called library X. Though this seemed to suffice, it is still a very good indication re-design was a necessity for the new system data.
Many issues of performance were sited by people running their ingests or requests on development/test machines. There are many other processes on these machines which tend to effect the performance of any given process.
Further, each ingest or request would consume huge resources. So, if more than one of these operations were going on there would be contention invoking process swapping and resource distribution constraints.
Once again, a controlled environement would have been necessary to eliminate the questions of the effects of shared enviornments such as was seen in development and testing.
In early testing of the HDF library, while trying to write 200,000 datasets, a testing process ran out of the 3 gig process limit for 32 bit machines. Eventually a max point to around 7500 datasets for 32 bit machines was realized. However, there were predictions of 200,000 or more datasets were in the requirements.
Eventually testing moved to a 64 bit machine. This resulted in tests that were able to push up to 400,000 datasets and a 14gig process memory usage (which could go higher if more more memory was in the machine). It would seem that 64 bit, at least for the serial design, would be necessary to meet projected requirments.
However, if 64 bit design was necessary it certainly should have been determined at the beginning of the project. It would be likely that a lot of code in the entire project would need to be changed to build and operate on 32 bit machines. If nothing else, intensive testing would need to be done, something which is even lacking in the 32 bit environment.
As well, the serial design still exists and effects performance. It is likely that 64 bit machines are not necessary, since parallel execution to achieve desired results may be needed.
For company P the groups for handling this kind of function we can call CNS. CNS is the means by which network, hardware and software requirements are provided. It is set up in such a way that a request is entered via a web page and results are rendered based on a priority basis. Communication on the request is done via email and an online forum via the request with web based software.
When the necessity to request a controlled environment through this team arose, it was decided to test this process as experience before has had low quality results and the importance of a controlled environment was very high. 5 CNS requests were submitted one of which was for the controlled environment, a machine with capabilities to serve the purposes of this projects goals but without any other software or hardware encumbrances. All 5 requests failed to be fullfilled. The other requests were much more minor in importance however would have provided usefull nonetheless.
One thing of notice is that the CNS teams are actually quite unaware of the necessities of the project teams. They are quite separate from the project management, and as well, somewhat indifferent about the necessities. Ultimetely the interface mechanism as well as purpose of this team proved inadequate.
Based on the project managment method overall work-to-result is divided into tasks. Each task is then given an estimated (huge guess) amount of time. The tasks are then distributed to each developer.
It would appear not a single task really has adequate time for a complete development cycle. Perhaps the biggest lacking is the time it takes for a developer to actually unit test. For every 1000 lines of code generated a developer should write at least as many lines of test code to ensure the proper functioning of the code, depending on complexity. Most of this code should be purposed to actually break the end result, not validate it based on a perfect scenario. As well, a lot of this testing code should validate performance.
This project focused mostly on the perfect scenario, and primarily on flow. Performance focus is actually a reaction to failed performance, and at a late point in a code base, it is very difficult to re-tool (re-factor) the code for enhanced performace. As well there was a negative reaction to the words "re-factor" from project management which ultimately results in an absolute need for enough time to perform testing by developers up front.
First and foremost it should be understood that QA testing should only be used as a validation to the testing which a developer should do. Of course, it is necessary for the developer to have enough and spend enough time testing.
This project suffered from the inclusion of QA testing in the development cycle, which should never be the case. The developer gets just enough time to frame some code, make sure it compiles, run a simple test or so, and then it winds up getting QA'd. Then, of course, bugs happen, which involves a whole process of tool interacion, meetings, etc, wherein time is consumed and wasted, all because QA shouldn't be in the loop in the first place. Further, the developer is now in a reactionary state to the code. Creativity and fluidity are lost.
Also, as it pertains to performance, there were no real stated criteria beyond just a hope for achieving a run in so many hours. Therefore, there isn't really anyway for QA OR developers to be effective in finding problems in any given component based on testing. IE, how many records a second must the HDF library actually write to a disk to achieve perforance goals?
Performance tools can help developers solve performance issues in many cases, but not necessarily in all cases. A good tool may show that a process is spending 75% of it's time in a particular procedure or function. This only suggests however this is a likely candidate for performance enhancement. It could be that the function itself is perfect and 75% of process time based on design should be in this function. IE, there isn't any reason to think that any given process should somehow have some kind of even distribution of time across all functions or procedures which make up that process.
Effective use of a profiling tool requires experience with that tool. Minimally, it requires a tool to be properly installed and usable. The proper installation of a profiling tool, or any other type of tool for development purposes, should take place at the beginning of a project not later in a project in a reactionary condition. These type of tools work better for some circumstances, worse for others, and therefore require up front analysis and testing before the beginning of the project in order to determine the best candidates for the software requirments.
Any tools used in this project for performance testing were purchased and installed as a reaction to performance issues. The setup of the tools were performed by CNS rather than developers. (see sudo necessity). It is therefore unknown rather the tools were installed correctly. There was no chance to verify that the tools used were actually accurate in their determinations. The most promising tool, VTune from Intel, used a java based user interface which collapsed under performance testing. Though at least one developer used the command line of this tool effectively, the validity of those results based on overall experience with this tool is questionable. It was a great sale for Intel however.
Though performance tools should be an benefit in software development, in the case of this project they may have actaully proved detrimental.
Though initial setup of a controlled environment can be performed by a team like CNS, it is essential to the evolution of software being developed for the machine to be fully accessed and administered by developers. This requires sudo access on unix/linux machines. Tools like Vtune usually require kernel tweaks that are essential. As well, there are other tools that can be installed that require build processing, experimentation, build again scenarios. This kind of of development requirment was completely undermined for this project by the insistence that CNS should be the only administrators of development machines.
This is one of the best reasons why the controlled enviornment should itself be the developers workstation. This is not to suggest a developer should have a quad or eight processor machine with 40 gig of memory. Things can be accomplished in scale. This is to say that the same OS setup and software installed on the workstation would suffice as a controlled environment for a much higher end production machine, and that the speed, memory and disk capabilities for a single user workstation be adequate.
Software development requires a concentrated mental effort. For some, like myself, it is also an expression of personal creativity. Resulting goals of software then can be effected by ones ability to concentrate and be creative, including performance. Ergo, I have listed some matters which effect concentration and creativity.Microsoft Windows
The software of this project was a linux based result. At least some of the developers on this project had been developing on unix/linux for some number of years, and as well, using linux as a client. For those who use linux as client, windows presents a cumbersome, problematic tool in the development cycle. It becomes a distraction of concentration and creativity with negative effects on the end result, the code. Further, the use of tools like spreadsheets, word documents, visio presentations, etc, are also very distractive and cumbersome. Linux should definitely be offered and facilitated for use as a client workstation by any company doing unix/linux development.Synergy
Many highly qualified developers on numerous projects at company P have spoken up about the detriments of this tool. It is the worst software tool I have ever used, and most certainly had major effects concentration and creativity. Given my interaction with, and the help I provided other developers, I must state that I am not just speaking subjectively here.
The use of HDF and other external libraries did not integrate with Synergy easily as they use standard open source development methodologies and tools like the GNU suite make/autoconf/automake. Therefore, parameters for objectmake (Synergy make tool) do not necessarily coorespond to parameters determined in builds of extrnal tools. This issue hasn't even recieved any evaluation. Also, there isn't any integration components for development environments like emacs and vim for Synergy.
For evaluating performance issues one could not just run on the test machines where code was being generated by the Synergy tool. There were too many other processes being run on those machines as stated earlier. One had to run tests on a separate machine. It was not possible to move the codebase from the control of Synergy to these other machines. One could only build a very limited subset, say library X, HDF and a test tool. These were only possible because they were built to be separate from the Synergy based project in the first place.Dispersed development environment.
If one were using linux on an adequately endowed machine one should never have to do more than update a single cvs/subversion repository on a development server once components one is working on have been adequately tested. There may be cases where integration testing would have to be done on the development servers, but this would be minimal.
For this project one had to connect from an MS windows machine to an HP machine where Synergy was residing, compile code which is then dispersed via ssh to numerous linux servers each of which may be slightley different than the other, and then one had to log into those machines to run tests. For debugging most people used ddd networked from the linux machine of choice across the hp machine to the MS windows X windows emulator. WOW, what a mouthful. I proposed this team add a machintosh with which everyone can run their editor and a sun machine for the debugger just to round out the suite of distributed development environment.
Also, most documentation was done using windows office tools and then planted out on some windows share somewhere inconsistent with any other document produced yesterday.
In the cases of doing performance testing on isolated 64 machines it was necessary to log into some linux machines only by way of another linux machine.
Certainly just in reading this one should be able to understand it's effects on concentration and creativity.Facilites
I first worked for about 6 months in a quad cubicle scenario. This proved to be very noisy and distracting. Isolated cubicles aren't too much better, but at least there is a slight buffering of noise and distraction. A small office works best because one can close the door, put up a no disturb sign and concentrate.
The second 6 month period I sat in an open desk in a hallway with other people. The area I sat in had been constantly under noise of all sort. Absolutely impossible to concentrate.Multi statuses
At the worse developers on this project had been required to give 7 statuses per week. Stand up statuses every day at 9 AM, a project manager status on Wednesday, and a team lead status on a Thursday.
Worse, the status for the project manager and team lead were emailed documents. Worse than that they were of different formats, one being RTF based with one layout and the other being MS DOC based with a different layout.
Even 4 statuses a week, 2 stand up and the documents, that was still a lot of distraction from concentration on developement.Rigid process
The mechanical process followed by this project wherein a developer is given a relatively short task to complete doesn't suit itself well for creative people. It may work well for those who see development in this way, ie mechanical or assembly line oriented.
As a creative developer it is much more appropriate to work on something that constitutes a creative scenario. Also, some developers were meant to work on something that takes a whole year or two, not just a week or two.
Working on little tasks is a very difficult thing for a creative person to concentrate on. Each developer should be considered in what type assignment they should receive.
There were a number of people who had been promoted into leadership roles who were either just out of internship or had low software development experience. It's great to have zeal and want to make more money or climb the ladder, but promotion of low tenure into leadership has detrimental effects on software results, including performance.
This project used what I call "connect the dots" methodology. It starts with desired product, tries to identify all the "tasks" to get to the goal, puts the tasks into a microsoft project spreadsheat, guesses how long each task will take, and then assigns the tasks out. Certainly there are more things involved. IE, tasks can be worked on in parallel and those things should be assigned accordingly, certain things should be defined as risks (unknowns), etc.
The end result of the connect the dots method is always the same, a frame of the picture. One can tell what it should be or what it looks like, but really, it's far from done as the painting of the picture still needs to be done. And that takes much more time than the dots did.
There are many failings of connect the dots which effect performance of software.
In a time critical project it is essential that multiple people/teams be used in parallel to develop the EXACT SAME THING. WHY? Because this has a higher chance of fulfilling the end result. It covers the unknown better. That which we know is just about non existent over against that which we do not know. It covers the problem of differences in developers. No two developers are alike of course, and some will develop some things better than others. It offers more design potentials. It covers the imperfections of the first time scenario.Need for experimentaion and analysis
Any software project needs to have a lot of analysis and experimentation to achieve the best results. The idea that anyone given a task will perform that task flawlessly hitting the best result with minimal understanging of all necessary information is flawed.
As can be seen there are many factors which effect software quality that have nothing really to do with the actaul code be written. This is just a tip of the iceberg of consideration.