Formatting a 25M-line codebase overnight

(stripe.dev)

85 points | by r00k 2 hours ago

12 comments

CrzyLngPwd 1 hour ago
One of my first jobs was a small software company writing software for a small number of clients, in MS basic PDS.
The lead developer didn't like to bother with formatting code, so I wrote a tool called makenice to format his nasty spaghetti gibberish into something with good indents and layout to make it easier for us normal people to parse.
He was furious, literally spun in circles about it right in the office in front of everyone, so I wrote makenasty to format code into the way he appeared to like.
I only shared makenasty/nice with a couple of the team, who loved it, as it allowed easy conversion between something readable and something the team lead like.
He never knew about makenasty.
[-]
- nitwit005 53 minutes ago
  If he didn't bother formatting code, it would seem impossible to create a tool that formatted code the way he preferred.
  [-]
  - singpolyma3 48 minutes ago
    Sounds like he did format code, and even had opinions on how it should be formatted, but OP disagreed.
- munk-a 1 hour ago
  Outside of the naming - this is a perfectly sane thing to do for developer comfort and can usually be accomplished with simple transformations.
  There are often limitations (like manually added indentation/spacing for alignment) but as long as you're very intentional about what changes you'll allow and have a good understanding of the language it can be an extremely safe operation.
- Terr_ 19 minutes ago
  I find a lot of these conflicts I can't resolve when everybody agrees that the pain of ugly/unnecessary diffs is greater than the pain of minor formatting disagreements.
hobofan 1 hour ago
I'm surprised they went with a all-at-once reformat. Even when doing it over a weekend this is bound to mess with a lot of open PRs at their scale.
I had to introduce a formatter in a few sizeable codebases in the past (few 100k to few million LOC), and I always did it incrementally via a script that reformatted all files that are not touched in any open PR. The initial run reformatted 95% of all files. Then I ran the script every day for ~two weeks and got up to 99.5% of all files and then manually each time one of the remaining ~dozen PRs that were WIP for longer were merged.
[-]
- rileymichael 37 minutes ago
  both options have their pros and cons. if you utilize some form of ratcheting[1], you can sneak it in without your team knowing.. but all of your PRs for the foreseeable future will have a ton of reformatting screwing with your git blame. if you do it all at once, someone will have to sort out conflicts, but you can utilize `blame.ignoreRevsFile`[2] so that your history remains useful
  [1] https://github.com/diffplug/spotless/tree/main/plugin-gradle...
  [2] https://git-scm.com/docs/git-blame#Documentation/git-blame.t...
  [-]
  - BobbyTables2 15 minutes ago
    That’s a neat feature, thanks for sharing.
    Unfortunately I find that code bases lacking auto formatting are often littered with non functional changes as developers temporarily instrument code, remove it, but leave whitespace changes behind.
    In terms of tracking code changes, one really would have to rewrite the entire history with each commit reformatted.
  - hobofan 29 minutes ago
    Yes, that is a good point. This is also why I personally would recommend to let a central person/team handle the reformatting rather than sneaking it into every PR (- see my sibling comment). That way you can be in charge of having a uniform style of commit messages to make the reformat commits easy to identify and create a well kept ignoreRevsFile. I think that provides the best of both worlds.
- skydhash 1 hour ago
  You can always let the team know so that they can apply the formatter on their PR branch.
  [-]
  - hobofan 35 minutes ago
    In the smaller migrations I did I tried that, but some way or another a decent chunk of the people still managed to get stuck in merge/rebase conflicts. I would almost explicitly not recommend giving that advise to the teams.
    My rough blueprint for introducing formatter or linter nowadys would be:
    - Recorded knowledge share session around how to set up the tools for local use 1-2 weeks before the initial rollout, and outline how the process will take place
    - On the day of the initial rollout send out a reminder + the recording again
    - Do the initial PR
    - Incrementally do the rest of the migration, and subscribe to the PRs that drag out the process
  - jrajav 36 minutes ago
    This is exactly the remedy to the PR issue. I've "lucked" into owning a Prettier formatting pass at two different places now, and did the same process at each - full pass on master, simple step-by-step process to follow to update any PR by running the format script.
munificent 1 hour ago
> We chose a Saturday to format the entire codebase to avoid merge conflicts. And while our test suite gave us high confidence we'd gotten everything right, it's always a bit daunting to have a diff so large that GitHub can't render it.
The dart formatter has an internal sanity check. It walks through the unformatted and formatted strings in parallel skipping any whitespace. If any non-whitespace characters don't match, it immediately aborts. This ensures that the only thing the formatter changes is whitespace, and makes it much less spooky to run it blind on a huge codebase.
That sanity check has saved my ass a couple of times when weird bugs crept in, usually around unusual combinations of language features around new syntax.
(Unfortunately, the formatter in the past year has gotten a little more flexible about the kinds of changes it makes, including sometimes moving comments relatively to commas and brackets, so this sanity check skips some punctuation characters too, making it a little less reliable.)
[-]
- Terr_ 17 minutes ago
  I imagine a fancier version would be to compare the Abstract Syntax Trees.
varun_ch 2 hours ago
I’m shocked at the 25M line part! That is a completely unfathomable amount of code for one codebase. I really want to know more about that.
[-]
- bruckie 1 hour ago
  Only 25 million? :) Google had billions a decade ago...
  https://research.google/pubs/why-google-stores-billions-of-l...
- deathanatos 9 minutes ago
  My (much smaller than Stripe) company is well over 4.5M at this point, and the graph is very much exponential.
  AI has been a huge problem here: the amount of code is just exploding. Quality of the produced code is another matter.
- jsnell 2 hours ago
  Right, where is the rest of the code?
- mr_mitm 2 hours ago
  They're up to 42 million now, as per the article
  [-]
  - lukan 1 hour ago
    That sounds even more insane to me, but I guess most of that code does not really touch financial transactions, otherwise it would be a nightmare being responsible to verify that.
    [-]
    - clintonb 58 minutes ago
      Ruby code touches financial transactions. Card payments were migrated to Java when I left in 2022. Non-card payments (e.g., ACH, checks, various wallets) were still processed by Ruby.
      PCI-related/vaulting code lived in its own locked-down repo. I think that was a mix of Go and Ruby.
      Once you have the foundations in place for account balances and the ledger, processing a payment isn’t that daunting. Those foundations, however, took a lot to build and evolve.
nitwit005 44 minutes ago
> Given that complexity, the hypothesis was simple: tackle the hardest syntax first and the rest will follow.
Always nice to see. I've seen people fall into the trap of designing for the common case, not realizing most of the code will be to deal with the less common cases.
comrade1234 43 minutes ago
Man must me nice to have the time to put so much work into tabs.
burnte 1 hour ago
The floating spiral thing is so distracting I spent more time deleting it in Inspector than reading the article. I feel like they hate their readers. Awful.
[-]
- annaspies 24 minutes ago
  If you set `prefers-reduced-motion: reduce`, it goes away
hokkos 1 hour ago
Now it makes me wonder, are those 45M LoC are untyped ?
[-]
- c3ab8ff137 1 hour ago
  No, Stripe has its own Ruby typechecker - https://sorbet.org/
- m12k 1 hour ago
  https://brandur.org/nanoglyphs/015-ruby-typing#ruby-typing
exsol 2 hours ago
[dead]
CrzyLngPwd 1 hour ago
Surely, it no longer needs to be human-readable, and the era of write-only code is finally upon us with the dawn of AI writing our mealtickets.
Why bother formatting 25m lines of slop, and why is AI wasting tokens on making code look human-readable anyway?
andrewstuart 2 hours ago
A major financial processing company writes it money handling systems in Ruby.
Terrifying.
[-]
- mbStavola 2 hours ago
  Considering that it's been doing so successfully at volume for just over 15 years, I think their language choice was fine.
- sixo 1 hour ago
  This ought to change your mind about Ruby!
- sunrunner 1 hour ago
  Things can always be worse. It could be PHP, for example.
  [-]
  - burnte 1 hour ago
    Facebook runs in it, so I think the language itself is probably a fine choice.
    [-]
    - Twirrim 1 hour ago
      It's almost like other factors than language choice are more important :)
- skinfaxi 2 hours ago
  Why is that terrifying?
  [-]
  - mikedelago 1 hour ago
    Some folks don't like shipping
  - Jtsummers 2 hours ago
    It's not particularly terrifying. Some people really just don't like Ruby.
  - fantasizr 1 hour ago
    ive yet to see a compelling elitist programming language opinion. especially when used at big successful companies. these companies don't function in spite of their technology choices.
    [-]
    - lstodd 1 hour ago
      > these companies don't function in spite of their technology choices.
      shows you never worked at "big succesful companies".
- sikozu 2 hours ago
  The systems have to be written in some kind of programming language, and I think Ruby is a perfectly fine choice.
  [-]
  - Imustaskforhelp 1 hour ago
    Not denying that Ruby is a perfectly fine choice but within the article itself it says that Stripe runs the world's largest Ruby codebase so certainly it might be testing the constraints of the language.
    The thing I am interested is that I don't suppose that Stripe always had these many LOC's and so I would be curious to know if at any point as the codebase was increasing, were they looking at other new languages which were coming like golang or rust which was more suited for their work or not and what were there decisions/thinking process to continue using ruby.
    [-]
    - throwaway041207 17 minutes ago
      Stripe uses Sorbet which, in my experience, increases LOC.
    - clintonb 55 minutes ago
      LOC doesn’t have much to do with the “constraints of the language”.
      Stripe has dabbled in Golang. There is also a growing Java monorepo.
- semiquaver 1 hour ago
  I’d hardly call Sorbet Ruby :)
- benbristow 2 hours ago
  [dead]
cadamsdotcom 1 hour ago
An insight about code is that compared to the scale we operate on data, code as text is tiny. Instantaneous git operations and “run this tool over all the code” are the norm even while we wait for LLMs to stream their tokens to stream back so tool calls can operate on it.
That insight might seem obvious - but if you stay cognizant of it as you work, you can invent some pretty amazing tooling for yourself & your team.