Thoughts while changing the API of a massive framework...
At the Bank we have a home-grown ETL framework that we've been using for quite some time. We recently completed a total rewrite, but unfortunately we left out a few changes. Had I gotten those changes in 5 months ago, I would have only had to break the API of about 10 modules. Today, in order to make those changes, I have to break the API of 122 modules.
What follows is an account of this ordeal, provided for entertainment value only. There will be a future post that explains some of the things I did to make this task surmountable.
- Day 1:
- 3:45pm - 122 modules left
- 4:31pm - 112 modules left - And then I remember there's another feature to add that will require another migration of all these modules I will have to do.
- 4:52pm - 106 modules left - Test::Continuous removes 3 steps for each module. Total time saved: HOLY FUCK THAT'S AWESOME
- 5:35pm - 97 modules left - Every commit message during this ordeal is another love note to those who put off this migration five months ago, when there were only 10 modules to migrate.
- 6:09pm - 94 modules left - New API to change: Create a role to do it for me! +100 experience points!
- 6:15pm - 93 modules left - Why unpack the hash of args passed-in to the method if the method you're calling takes exactly the same arguments?
my $arg_name = $args{arg_name}; return $self->method( arg_name => $arg_name )
should never happen! - 6:37pm - 87 modules left - A thought: If the other team using this project ultimately rejects this API change, I get to write my own brand-new ETL framework from scratch! Temptation, thy name is Zoidberg.
- 6:51pm - 84 modules left - Found a bug in the new API! Finally something interesting to do!
- 7:00pm - 80 modules left - Every time you copy/paste code in tests, God inflicts another programmer with carpal tunnel. Please think of the programmers.
- Day 2:
- 3:02pm - 80 modules left - Let's see if I remember all the macros I left in vim over the weekend... Test::Continuous is still running, which is nice
- 3:18pm - 71 modules left - The end is in sight!
- 4:30pm - 52 modules left - Perhaps I was premature...
- 6:30pm - 41 modules left - Caught up putting out fires in other places. Derail.
- Day 3:
- 1:35pm - 41 modules left - Another bug in the new API. Doing it this way is certainly shaking out the bugs.
- 2:21pm - 20 modules left - Smooth sailing at last...
- 3:47pm - 0 modules left - AND THE CROWD GOES WILD!
Total time elapsed: 3.25+3.5+2.25 = 9 hours. Not bad for 130 commits to migrate 122 modules.
I can't help but think that if you need to change 120 modules if ONE API changes you could have made some abstraction somewhere.
Still, I'm happy for you you pulled this one off...
I love how you keep detailed timing. I have been similarly keeping time on daily.org and weekly.org files for the past several months, but not yet to this level of detail.
That is a large number of modules to maintain for ETL. We have 3-4 main modules used for our ETL process.
Can you give a break down of the modules and what they are used for? What is your database you load to?
There is some refactoring we can do, but not as much as you'd think: These are all client modules of the framework. They're the implementations. They get/put data from/into databases (7, including CSV and JSON), they get data from various analytic libraries, news feeds, web scraping, and other such. Then there are the transform modules that run more analysis (using 12 functions with different calling conventions from 2 different libraries).
It's all business logic, which can be difficult to abstract.
We have two teams that each have different databases. We have three databases that we directly manage: Sybase and two commercial time-series databases. We also have CSV and JSON "database" modules so we can chop up a job or save a copy of data for later auditing.
A bunch of modules are simply sources of data: News feeds from Reuters and Bloomberg, internal feeds, data from other internal databases not managed by us.
The bulk of the modules I had to change were data transforms. Fills (forward, backward, sparse), simple derivations (divide by X, multiply by X, reciprocal, etc...), rich analytics (using internal libraries), data normalization, remove holidays/weekends, and loads more. Here is where business logic is put, so a lot of these are ugly custom things that only one client really cares about.
The more pieces that are available, the easier it is to create the custom process that the client inevitably wants (we don't use our own data, someone else asks us to prepare data for them). We chain the pieces together to create the solution.