plant lover, cookie monster, shoe fiend
19659 stories
·
21 followers

Are developers slowed down by AI? Evaluating an RCT (?) and what it tells us about developer productivity

1 Share

Seven different people texted or otherwise messaged me about this study which claims to measure “the impact of early-2025 AI on experience open-source developer productivity.”

You know, when I decided to become a psychological scientist I never imagined that “teaching research methods so we can actually evaluate evidence about developers” could ever graduate into a near fulltime beat but when the people call, I answer. Let’s talk about RCTs, evaluating claims made in papers vs by headlines, and causality.

The Study 

This study leads with a plot that’s of course already been pulled into various press releases and taken on life on social media: 

In brief, 16 open-source software developers were recruited to complete development tasks for this research project. They provided tasks themselves, out of real issues and work in their repository (I think this is one of the most interesting features). These tasks were then randomly assigned to either be in the “AI” (for a pretty fluid and unclear definition of AI because how AI was used was freely chosen) or “not AI” condition. 

There are some important design details to consider: 

  • The randomization occurred on the level of the tasks not on the level of the people, which I’ve seen a number of folks confused by in the reactions to this paper. However, to me the randomization does not include standardization of treatment. The treatment can be quite variable depending on the task within the “AI condition.” This is probably why that caption says "can" use AI and not "using AI."
  • The tasks were not standardized between developers but were instead situated, real issues that these recruited folks were working on in their own repositories. Therefore the tasks are not directly comparable to each other, although there is some amount of context of the tasks provided. This will be very important to remember.
  • Repositories, then, is also a variable in this study that goes along with developer, because developers are working in a different project context and a different issue context. Presumably modeling developer with ‘AI/no AI’ within developer also includes repo and developers only worked in one repo. I just think it’s helpful to remember that the repo might include context that impacts across issues and changes the AI possibilities (e.g. a simple causal scenario: the same AI tool might fail at a task in one repo but be successful at it in another repo, caused by something about the structure of the repo).
  • There is a ton of developer free choice in this design. Notably developers got to choose the order they worked on tasks. They even sometimes worked on multiple issues at a time! 
  • They could also dial up or down the amount of “AI” in the treatment condition, and what type of tool they used. When doing an “AI” assigned task, developers could use AI pretty much as much as they wanted or down to and including not at all in the “AI” condition.
  • All developers got tasks in both conditions. This is what we call a within-subject design. This is also one reason that we have many observations in this study despite the N being low (small number of people, relatively larger number of tasks). 
  • Note that the intervals around the slowdown effect are large, while on the negative side, still quite close to overlapping 0%. It has also been pointed out that the original study missed designing the analysis correctly for repeated measures although an author has since come into that post to say the differences are not big when they do, so we will take their word for this. Either way, I will take some issue with treating this point estimate as a sweeping statement about developers’ time while working with AI. 
  • They also collect screen recordings, which aren’t much dwelled on here, and my guess is that a bigger project in the future will come out about this. They manually label 143 hours of recordings, which is impressive. That being said, it’s a bit opaque as to how the labeling categories are created and why they were the ones chosen. I find some of the overlap confusing in terms of categorizing the commonalities between AI and not-AI tasks in terms of the human activity involved, e.g., “reviewing AI output” could be considered a form of “reading”, and “Waiting on AI” sounds like the same thing as “Idle.” It is clear from G.8 that the labeling was done by contract and not by the authors themselves and it looks like the requirements are programming experience and Cursor experience, rather than behavioral science experience. 

Study context: This preprint/report (I’m not entirely sure what to call it – it’s hosted on their website), is put out by METR, which appears to be a donation-funded org doing what they describe as “evaluations of frontier AI systems’ ability to complete complex tasks without human input.” They also list some partnerships with companies, including OpenAI. I have zero connection to or knowledge of this org or any of these folks, their website seems to provide some nice transparency. I have done plenty of research work that relies on funding within the industry, including running a lab at a tech company. I think donation-based organizations doing research are providing an interesting and important service. I strongly think that software teams need more research and that tech companies who profit so much from software work would benefit from funding research. These authors deserve credit for sharing an open access report with rich detail, including the scripts and instructions that participant saw, as that deeply helps us evaluate their claims.  

On to the study! 

I have thoughts about this study, and they’re overlapping, so I’ve tried to roughly organize these into categories but research design is a many-headed hydra kind of thing, just like software development, where one issue frequently bleeds into others. 

Time, Forecasting, and Productivity? 

The study makes a rather big thing of the time estimates and then the actual time measure in the AI allowed condition. Everyone is focusing on that Figure 1, but interestingly, devs’ forecasting about AI-allowed issues are just as or more correlated with actual completion time as their pre-estimates are for the AI-disallowed condition: .64 and .59 respectively. In other words despite their optimistic bias (in retrospect) about AI time savings being off which shifts their estimates, the AI condition doesn’t seem to have made developers more inaccurate in their pre-planning about ranking the relative effort the issue will take to solve out of a pool of issues. 

I feel that this is a very strange counterpoint to the “AI drags down developers and they didn’t know it” take-away that is being framed as the headline. If we want to know if developers are accurate about predicting how long their work will take them when they use AI, well, it seems like the AI condition didn't make them any worse at ranking the issues, at least for complex work like software where it’s notoriously difficult to make a time estimate, although perhaps optimistically shifted. If we want to know on the other hand if “devs are kind of inaccurate when looking back at a bunch of tasks they finished with a new tool and being asked if the new tool helped,” well, they kind of are. But surely the per-issue forecasting and relative sorting is a pretty important piece of someone’s overall planning, problem-solving, and execution of tasks. In fact I know it is because these elements are core to problem-solving. And between-issue sorting could be a more tangible and specific measure of a developer's judgment. Arguably, good planning about tackling an issue is also a measure that relates to productivity.

Moving on from this, you could say, problem-solving goes into productivity, but it’s not productivity (I frequently argue in my talks and papers that it is useful for organizations and leaders to focus more on the former than the latter). Still, connecting increases in “time savings” to “productivity,” is an extremely contentious exercise and has been since about the time we started talking about factories. Given reactions in software circles to “productivity” generalizations, I’m surprised this doesn’t get more explanation in the study. The main thing I will say here is that it’s pretty widely acknowledged that measuring a simple time change isn’t the same as measuring productivity. One obvious issue is that you can do things quickly and badly, in a way where the cost doesn't become apparent for a while. Actually, that is itself often a criticism of how developers might use AI! So perhaps AI slowing developers down is a positive finding!

We can argue about this, and the study doesn't answer it, because there is very little motivating literature review here that tells us exactly why we should think that AI will speedup or slowdown one way or another in terms of the human problem-solving involved, although there is a lot about previous mixed findings about whether AI does this. I don’t expect software-focused teams to focus much on cognitive science or learning science, but I do find it a bit odd to report an estimation inaccuracy effect and not cite any literature about things like the planning fallacy, or even much about estimation of software tasks, itself a fairly common topic of software research. 

Nevertheless, I suppose the naturalistic setting of the tasks means that “finishing an issue” has some built-in developer judgment in it that the issue was finished well. To their credit, the study thinks about this as well, encouraging developers to sincerely finish their issues, and they know they're being recorded. So I'm not really concerned that people are doing something other than sincere working in this study.

If we are interested in learning about what corrects the perceptions of developers about the time it takes to complete an issue, I would like to know more about how developers triage and assess issues before they start to do them. That is not necessarily “productivity,” but metacognition. 

Time estimation as a human cognitive task. This might seem like a small quibble – but the post-task time estimate is not the same operationalization as the pre-time per issue estimate, and as an experimentalist that really grinds my gears. We think very carefully and hard about using the same operationalizations for things, and asking participants to make exactly the same judgment or do the same responding in a repeated-measures design. Their design has developers estimate the time for each issue with and without AI, but then at the end, estimate an average of how much time AI “saved them.” Asking people to summarize an average over their entire experience feels murkier than asking them to immediately rate the “times savings” of the AI after each task, plus you'd avoid many of the memory contamination effects you might worry about from asking people to summarize their hindsight across many experiences, where presumably you could get things like recency bias, with people’s overall estimate of “AI time savings” being impacted more by the last task that they just did. I’m not sure why you would do it the way it’s done in this study. 

I have another social-psych-feeling question about this study design which is whether asking people to estimate the time-savings of a task twice with the only differentiator being “now imagine doing it with AI” feels like a leading question for people. People like patterns. If you ask someone to make two different estimates and change one thing about the one estimate you’re pretty much encouraging them to change their guess. Of course that doesn’t explain the optimism bias, which I think could certainly be real, but I’m just pointing out that it’s not a given that we’re eliciting “deep beliefs about AI” here. 

I wonder if social desirability plays a role. People are always anxious to not be wrong about their guesses in studies. The instructions to participants even emphasize: “we expect you might personally find it useful to know if AI actually speeds you up!” and the study is named “METR Human Uplift Pilot,” all of which any reasonable participant might read and think, “this study is about AI speeding developers up, so it probably sped me up.” I have learned working as a psychologist with software teams that people will believe a frightening number of things you say just because you have a PhD, and it’s worth being careful about accidentally conveying researcher motivations in the messages participants are hearing. In fact, the advertise also says “the study aims to measure speedup”! 

They also mention that to incentive accuracy within the experiment, they pay for more correct point estimates, which is an interesting choice and also possibly a confound here if you're getting the message that the study is about "speedup." The evidence for whether monetary incentives increase accuracy is mixed. Hard to say what this does in this case, but I think we can ask whether this post-work estimation is a comprehensive measure of developers' persistent, robust beliefs about AI.

Order effects. 

Because developers can choose how they work on issues and even work on them together, this study may inadvertently have order effects. Consider this: when you sit down to work on issues, do you work completely randomly? Probably if you’re like most people, you don't. You may have a sense of pacing yourself. Maybe you like to cluster all your easy issues first, maybe you want to work up to the big ones. The fact that developers get to choose this freely means that the study cannot control for possible order effects that developer choice introduces. Order effects can certainly impact how we work.

Possible order effects can troublingly introduce something that we call “spillover effects” in RCT-land. This happens classically when information from one condition is shared to the other condition, and so treatment can impact the supposed “non-treatment” observation. We could also call this contamination, or interference – I like to use the term contamination for the visceral accuracy. Suppose that working with AI has an effect that lingers after the task is done, like a developer is reading their own code differently or prioritizes a different task within the issue. Suppose that one condition is more tiring than the other, leading the task immediately following to be penalized. In text they say "nearly all quantiles of observed implementation time see AI-allowed issues taking longer" but Figure 5 sure visually looks like there's some kind of relationship (Figure 5) between how long an issue takes and whether or not we see a divergence between AI condition and not-AI condition. That could be contained in an order effect: as will get tiring by the end of this piece I'm going to suggests that task context is changing what happens in the AI condition.

As uncovered by the order effects here, there is also a tremendous amount of possible contamination here from the participants’ choices about both how to use the AI and how to approach their real-world problem-solving. That to me makes this much more in the realm of a “field study” than an RCT. Well-informed folks have reasonable disagreements about this; Stephen Wild for instance points out that a within-person RCT is a reasonable descriptor. I don’t land on the side of feeling comfortable calling this an RCT in the context of what we are specifically trying to test here (and because it annoys me when studies use medical research language to confer authority to their claims over and above other methods, which is strongly suggested in the intro). I personally think that the term RCT is implying you have a very strong argument for the “controlled trial” part of the design in the context of the contamination that’s reasonable to expect. Ben Recht’s writeup lands on my side: “First, I don’t like calling this study an “RCT.” There is no control group! There are 16 people and they receive both treatments. We’re supposed to believe that the “treated units” here are the coding assignments.”

Really, the critical issue I and others keep coming back to is that in this study design, we can never observe the same task solved by AI and not being solved with AI. There is a lot of information contained within these developers’ heads about these issues, presumably a project history, and their own metacognitive strategies about how they’re problem-solving. This isn’t purely a look at the treatment effect of AI usage.

It’s worth noting that samples of developers' work are also nested by repository. Repositories are not equally represented or sampled in this study either; while each repo has AI/not AI conditions, they’re not each putting the same number of observations into the collective time pots. Some repositories have many tasks, some as few as 1 in each condition. Repositories and their nature and structure were something of a causal ghost for me in this study. Given that the existing repo might very steeply change how useful an AI can be, that feels like another important qualifier to these time effects being attributed solely to the AI, rather than estimates that should be disaggregated down to developer * AI * issue * repository interactions. 

Developer differences & who these developers are 

I thought it was striking that developers in this study had relatively low experience with Cursor. The study presents this in a weirdly generalized way as if this is a census fact (but I assume it’s about their participants): “Developers have a range of experience using AI tools: 93% have prior experience with tools like ChatGPT, but only 44% have experience using Cursor.” They provide some minimal Cursor usage check, but they don’t enforce which “AI” developers use. Right away, that feels like a massive muddle to the estimates. If some developers are chatting with chatgpt and others are zooming around with Cursor in a very different way, are we really ensuring that we’re gathering the same kind of “usage”?

 I think a different interesting study would have been testing the effect of training people in a structured and comparable way to use a single AI tool, with an informed set of learning goals, and testing a pre- and post- about how they tackle the same issues, where we control the order effects as well. Particularly because we know that effects like pretesting can change how well people solve problems with the assistance of technology – your metacognition about how you are solving a problem can have a huge impact on your problem-solving.  

The study does not report demographic characteristics nor speak to the diversity of their sample beyond developer experience. This isn’t uncommon for software work, but I don’t like it. I do not mind case studies and small samples, I think they are often very necessary. But if you’re studying people and making claims about how all developers work I think you should include large social demographic categories that robustly and constantly impact the experiences of people in tech. This potentially also matters in the context of the perceptions developers have about AI. In my observational survey study on AI Skill Threat in 2023, we also saw some big differences in the trust in the quality of AI output by demographic groups, differences which have continually come up when people start to include those variables.

Who we’re learning from also matters because the claim of the study is that this is interested in the effect of AI on productivity in general. The study cautions its limitations throughout the end sections, but what we think is a representative sample of "developers" (both current state in terms of the industry now, but also with an eye toward important goals like including underrepresented perspectives, because the industry of the future looks different) is an important thing to get clearer about. Jordan Nafa points out that ideally you would want to think about a rich pre-survey that clearly gauges experience participants have using AI coding tools, and I would say even more specifically that we could consider designs which identify specific skills and attempt to do a skills inventory or audit. Every study cannot contain everything, but these would be ways to include different sorts of important information into our model and could possibly provide important validity checks.  

What type of effect should we even be expecting?

Continuing with our research design hats on, I want you to ask a few more questions of this research design. One big question that occurs to me is whether group averages are truly informative when it comes to times savings on development tasks. Would we expect to see a single average lift, across all people, or a more complex effect where some developers gain, and some lose? Would we expect that lift to peter out, to have a ceiling? To have certain preconditions necessary to unlock the lift? All of this can help us think about what study we would design to answer this question.

The idea that whether or not we get “value” from AI changes a lot depending on what we are working on and who we are when we show up to the tools, is something much of my tech community pointed out when I posted about this study on bluesky. Here’s a reply by David Andersen

This is very much my experience also. I've had Claude write a bunch of html+JavaScript UI stuff for me that's totally out of my normal practice and it's been fantastic. It is anti-useful for more than autocomplete in research code in my domain (though great for quick graphing scripts).

It’s worth noting again that developers’ previous experience with Cursor wasn’t well controlled in this study. We’re not matching slowdown estimates to any kind of learning background with these tools. 

Reactions to this study: get good at recognizing the bullshit 

Thus far, we’ve been taking a scientific walk through this study in the hopes that it’s an interesting aid to thinking about research methods. But let’s be real, few people in this industry are taking findings direct from the appendices of a study, unless you are in the know and signed up for this newsletter. 

To the extent that the news talks about developers at all, it usually wants a dramatic story. Let’s look at how three different pieces of news media describe these findings, ranging from misleading to flat wrong to good. I googled “AI slows down developers” and pulled a few takes.

Reuters: “Contrary to popular belief, using cutting-edge artificial intelligence tools slowed down experienced software developers when they were working in codebases familiar to them, rather than supercharging their work, a new study found.” and "The slowdown stemmed from developers needing to spend time going over and correcting what the AI models suggested."

While not factually inaccurate, it’s pretty misleading if you don't read the study. I don’t know why journalists always feel the need to call every piece of science they report on “in-depth,” but it implies that vast amounts of developers were studied. While it’s accurate to say these folks were working on codebases familiar to them, it completely skips by the suggestive evidence in the study that people find AI less helpful with their expertise/familiar areas and more helpful with more novel problems. It very confidently asserts that we KNOW why the slowdown happens (the study itself caveats this). It also ignores what I’m going to talk about next – the times this slowdown apparently vanishes.

Techradar has an article with the subheader: “Experienced developers aren't getting any benefit from AI, report claims,” which is more squarely in the realm of absolutely wrong about what the study even says. There are literal quotes from developers talking about benefits!

Techcrunch, on the other hand, has an article that felt more balanced to me. They give high-level study details, and even emphasize information about the participants  that contextualizes the evidence and how we should generalize about it for our own lives – e.g., “Notably, only 56% of the developers in the study had experience using Cursor, the main AI tool offered in the study.” They use language like “[these] findings raise questions about the supposed universal productivity gains,” which I think is a very fair description, and keep their prescriptive take to the sensible “developers shouldn’t assume that AI coding tools [...] will immediately speed up their workflows.” 

I am entirely of the mindset that leaders are the ones who need to be told this, but fair enough.  

I’ve been reading Calling Bullshit by Carl Bergstrom and Jevin West, and I recommend their chapter on causality for its relentless, but fun useful breakdown of how evidence gets mistranslated across headlines and social media and even by experts and scientists.  

Overall Impressions 

I think the most admirable and impressive part of this study is getting the least amount of credit in the press on it: their recruiting and the real-world setting. It’s difficult to motivate people to participate in research that takes a lot of effort and time, and choosing to recruit among a specialized population like open source maintainers, working in the context of their own projects, sets the bar even higher. This is an interesting population and setting to study.  I don't disagree with the push in the intro that many "AI benchmarks" kind of studies are artificially constrained. I just think we have a lot of behavioral science questions before we can be confident about this kind of estimate.

I suspect that the high bar for this kind of recruiting (and the sheer amount of time it takes to monitor people working through many tasks) is a piece of the small sample size (in terms of individual people – the task is somewhat the unit of focus in the study, and the tasks are a good collection). This is also a reason that I feel more kindly about the small sample size than many of the people who texted me this study. We often internalize the idea that small sample = bad. This really isn’t true, and it’s dangerous to apply as a generalization. We’ve observed dazzlingly important mechanisms in biology based on small samples. It all depends on your experimental question. 

However, you aren’t wrong if you had the instinct to ask how this limits us, because it limits us in precisely the way that the press and social media posts about this study aren’t taking into account. We want a large number of people if we’re going to make a generalization about developers as a group, not just within-developer effects. For instance, we want to consider whether we’ve done a good enough job of a) powering our study to show a population level effect b) including enough people, in some principled way, that we feel relatively certain our treatment effect isn’t going to be contaminated by unmeasured confounds about the people. Randomization of the treatment between people, in many cases, is better suited to handle those confounding questions, although no study design is perfect. 

But beyond that, the blowup about “slowdown from AI” isn’t warranted by the strength of this evidence. The biggest problem I keep coming back to when trying to think about whether to trust this “slowdown” estimate is the fact that “tasks” are so wildly variable in software work, and that the time we spend solving them is wildly variable. This can make simple averages – including group averages – very misleading. We have argued as much in our paper on Cycle Time. In that study, cycle time is roughly analogous to gathering up a bunch of “issues” and looking at the length from start to finish. We observed over a large number of developers (11k+) and observations (55k+), and our paper describes many, many issues with issues – for instance, we acknowledge that we don’t have a huge amount of context about team practices in using tickets or the context that would let us better divide up our “cycle times” into types of work. Nevertheless, we describe some important things that run counter to the industry’s stereotypes. For instance, that even within-developer, a software developer’s own past average “time on task” isn’t a very good predictor of their future times. Software work is highly variable, and that variability does not always reflect an individual difference in the person or the way they’re working. 

I am not convinced that saving time is the only thing on developers’ minds. Just like teachers, they may be asking more about what they do with their time. 

The fact that this slowdown difference vanishes once we layer in sorting tasks by whether they include ‘scope creep’ speaks to the fragility of this. Take a look at Figure 7 in the appendix: “low prior task exposure” overlaps with zero, as does “high external resource needs.” This is potentially one of the most interesting elements of the study, tucked away in the back and only collected for “the latter half of issues completed in the study.” This seems to me like a good idea that came late in the game here, an attempt to begin to slice away at that task heterogeneity. Setting aside the order effects and randomization design, this could have been a better plot choice to lead with, and I personally would’ve turned those “overlapping with zero” categories a neutral color, rather than keeping them red. Point estimates are almost certainly not some kind of ground truth with these small groups. I suspect that getting more context about tasks would further trouble this “slowdown effect.” 

Let’s keep in mind that the headline, intro, and title framing of this paper is that it’s finding out what AI does to developers’ work. This is a causal claim. Is it correct to say we can claim that AI and AI alone is causing the slowdown, if we have evidence that type of task is part of the picture? 

We could fall down other rabbit holes for things that trouble that task-group-average effect that is at the top of the paper, as in Figure 9, or Figure 17.

Unfortunately, as they note in 3.3., “we are not powered for statistically significant multiple comparisons when subsetting our data. This analysis is intended to provide speculative, suggestive evidence about the mechanisms behind slowdown.” Well, “speculative suggestive evidence” isn’t exactly what’s implied by naming a study the impact of AI on productivity and claiming a single element of randomization makes something an RCT. Despite the randomization, for practical and accurate purposes given what people imagine when they hear “RCT,” I land on the side of calling this a field study. As with most questions about empirical evidence, confusion and contrasting experiences with AI likely mean that we haven’t specified our questions enough. 

Some clues to this are hidden in the most interesting parts of the study – the developers’ qualitative comments. There, the rich heterogeneity and strategic problem-solving developers go through as they think about interacting with both problems and tools is more obvious. This developer * tool * task interaction (which can be exploded further into developer * community of practice * tool * task * future tasks they’re imagining that task relying on….software is hard work!) is very difficult to study. This is one reason that when we find associations in the world, we need to hold a balance between letting them form our hypotheses but also interrogating other causes that could exist and explain this same pattern.

This certainly sounds like people are finding AI useful as a learning tool when our questions connect to some kinds of knowledge gaps, and also when the repo and issue solution space provide a structure in which the AI can be an aid. And what do we know about learning and exploration around knowledge gaps….? That it takes systematically more time than a rote task. I wonder if we looked within the “AI-assigned” tasks and sorted them by how much the developer was challenging themselves to learn something new, would this prove to be associated with the slowdown? 

Other qualitative comments continue to speak to the task context, as it interacts with the AI, further showing us that this is not just measuring a “clean” effect of AI. Developers also presumably bring this thinking into their choices about how much they are using AI or not for a given issue in the “you can use AI as much as you want” condition. To quote him again, as Ben Recht noted in his piece on this study, the SUTVA does not necessarily hold (SUTVA stands for Stable Unit Treatment Value Assumption and it is a critical piece of the research design we rely on when we are thinking about RCTs for things like the effect of a drug on disease progression or outcome).

I think that entire last paragraph in the section screenshot (it's in C.1.2.) probably should’ve been in the intro to this paper. Actually, I think studying why experienced developers choose to use AI for a certain task, and when they don’t, and whether they think they get that wrong sometimes would probably be more interesting to me than time differences. That is an evaluation perspective that I wish this “evaluation of AI” world would consider, a behavioral science perspective on how developers plan and triage. It is not surprising that people are fairly bad at estimating the time future tasks will take, which is called the planning fallacy, and it is also not surprising that people relatively unused to a new tool will have an optimistic impression about it. But I don’t know anyone who actually works with human behavioral data who would ever try to use developers’ self-reports about something like time as if it were an objective systems measure. Maybe the idea that self-report and human perception has strengths and weaknesses is a big surprise for some people out there, but not for those of us who study bias in people’s perceptions of the world.  

Concluding thoughts 

Calling something an RCT doesn’t mean that the measures in the study are being made at the level of precision or accuracy that we might need to inform certain questions. In many takes, this study is being framed as causal evidence about the impact of AI on productivity. I deeply disagree that in the context of this study, the measures of time (or “slowdown”), as the authors write, “directly measure the impact of AI tools on developer productivity.” Direct is a very big claim for something that also contains treatment heterogeneity, task heterogeneity, developers’ individual choices, and other things we could think of here as spillover effects. 

But even if we’re fully bought into the "slowdown" as presented in this study, I don’t find it particularly alarming. Working with AI is different from working without it and it’s only the people who have a massive conviction that all work with AI will be immediately cheetah-speed who might feel amazed by this. I didn’t have that preconception, and I think workflow change is hard, but I also don’t feel like the “19%” measure in this study is going to turn out to be terribly reliable.

At the end of the day I believe the most important and interesting questions about developers tools are going to be about people's problem-solving and how that interacts with context. I don’t believe multivariate effects in complex human learning and human problem-solving result in simple additive relationships that show up equally for all people as long as we pick the right magic tool.

Simple group difference averages about complex work are easy bait for the popular press but almost never what we need to answer our burning questions about how we should work. Honestly, I understand why the intro reads like it does even though I disagree with it. Sometimes researchers don’t get rewarded for the most valuable parts of what they’re doing – figuring out difficult recruitment, and increasing ecological validity by focusing on real tasks. More of the Figure 7s and less of the Figure 1s can help us improve our evaluation of evidence. 

We are not going to be out of the era of extremely dramatic causal claims about software developers anytime soon, so everyone who works on or around a software team should learn how to evaluate claims about developer productivity. It’s not just that understanding software developer productivity attaches to a lot of money – it’s also that controlling the narrative about software developer productivity attaches to a lot of money. Journalism already hardly knows how to talk about technology without falling into breathless stereotypes, and headlines rise and fall by how extreme their claims are. But you don’t have to get swept up in the storm. Knowing just a bit of research design is a secret weapon you can pull out to decide what you think about claims. 

Read the whole story
sarcozona
3 hours ago
reply
Epiphyte City
Share this story
Delete

Musk’s giant Tesla factory casts shadow on lives in a quiet corner of Germany | Tesla | The Guardian

1 Share

When Elon Musk advised Germans to vote for the far-right Alternative für Deutschland (AfD) in elections last year, Manu Hoyer – who lives in the small town where the billionaire had built Tesla’s European production hub – wrote to the state premier to complain.

“How can you do business with someone who supports rightwing extremism?” she asked Dietmar Woidke, the Social Democrat leader of the eastern state of Brandenburg, who had backed the setting up of the Tesla Giga factory in Grünheide.

Hoyer said that in Woidke’s “disappointing, but predictable” answer, he denied the charge. “He said he didn’t know him personally. As if that excused him.”

She had co-founded a Citizens’ Initiative to oppose Musk’s plans, announced in 2019, to build in the sparsely populated municipality in the sandy plains south-east of Berlin. The initiative’s fears at the time were largely over the potential environmental impact of the plant on the region’s pine forests and groundwater.

However, more recently it is Musk’s politics that have caused particular alarm. Not only has he offered his high-profile support to far-right European parties, but at a rally after Donald Trump’s inauguration he appeared to twice make the Nazi salute.

In the meantime Tesla sales have slumped, especially in Europe – where new vehicle sales fell for five consecutive months despite an overall growth in the electric car market.

Heiko Baschin, another member of the citizen’s initiative, said he had been watching with a certain amount of schadenfreude. “We put our hopes in this,” the carpenter said, discussing the change in the company’s fortunes on a recent forest walk in the shadows of the sprawling Grünheide factory.

As sales have declined, the factory has suffered. Shifts manufacturing the Y-Model have been reduced from three to two a day. The trade union IG Metall – which recruited several hundred workers despite opposition from Tesla – has urged the company to consider putting workers on “kurzzeit”, the short-time work allowance much of the embattled car industry has introduced to enable it to retain workers during a downturn.

The regional press has reported how unsold Teslas have been moved on transporters en masse to a former East German airport 60km (37 miles) away, where, hidden behind trees and parked alongside solar panels, they bake in the sun.

Musk’s apparent Nazi salute was in general met with shock and horror in Germany but did not play large in Grünheide, until campaign groups projected an image of it on to the facade of the Tesla factory, provocatively placing the Nazi-associated word “heil” in front of the Tesla logo.

The shock caused by the incident was palpable on the factory floor, workers told the tabloid Berliner Kurier. “At Tesla Germany they had pretended they had nothing to do with (Musk) and were keeping quiet,” it wrote. Now they could no longer ignore their association.

Workers are hard to reach, most having been forced to sign non-disclosure agreements (NDAs). But on Kununu, a job portal where employees can anonymously vent their feelings about their workplace, one Tesla worker has written: “The brand once stood for cosmopolitanism, progress, and tolerance, but now it stands for the exact opposite. That bothers almost everyone here, and you can feel it”.

Almut, a resident of Grünheide, said local politicians were keen to cite the benefits Tesla had brought to the region, but “neglect to mention at the same time the problematic reality that we are subsidising the richest man in the world, who in no way takes any social responsibility for what happens here”.

She said local people joke among themselves about what might take the place of the factory, should Tesla fail. “A munitions factory? A prison? In some ways these would seem like favourable alternatives,” she said. The only positive contribution as she saw it that Tesla had contributed to Grünheide was a robotic lawn mower it had donated to the local football club.

Two weeks before the salute, Musk had followed his endorsement of the AfD in the German federal elections with an hour-long conversation with the anti-immigrant party’s co-leader, Alice Weidel. The two discussed topics including Hitler, solar power and German bureaucracy, which Musk said had required Tesla to submit forms running to 25,000 pages in order to build the Grünheide factory. Unmentioned was the fact that the AfD had vehemently opposed the Tesla factory, citing its fears over US-driven turbo capitalism and a watering down of workers’ rights. “People really need to get behind the AfD,” Musk said.

For Grünheide’s residents who oppose Musk, their preoccupation remains the impact of the factory on their rural community, which is characterised by its woodlands, lakes and rivers.

Existing cycle paths have been diverted, and new roads have required the felling of large swathes of pine forest, threatening the already perilous supplies of drinking water in a region declared a drought zone, the driest anywhere in Germany.

The 300-hectare (740 acre) large factory complex itself is due to be expanded in the near future by a further 100 hectares, under plans signed off by Grünheide’s mayor despite a local referendum in which 62% expressed their opposition.

Supporters point to the 11,000 jobs the factory has created, and the boost it has given to the local economy in a region of the former communist east and which was one of the lowest-performing in the country. Some young people enthuse that the trains to Berlin now run more regularly, the supermarkets are better stocked, and that their home town is now on the map as a beacon of “green capitalism” alongside Shanghai, Nevada and Austin, locations of the other Tesla factories. They hanker for an invitation to the “rave cave” techno dance space Musk has allegedly constructed within the factory complex.

The recruitment page of the factory’s website – which emphasises that diversity is at the core of its business model – shows a lengthy list of positions needing to be filled, from shift managers to maintenance technicians.

Nevertheless, the mood has cooled even among those who used to enthusiastically speak out in favour of Tesla, such as a group of local teenage schoolboys who habitually flew drones over the site when the factory was under construction and proudly posted the footage from them on YouTube – until Musk asked them to stop. “Nobody is willing to speak publicly about Tesla/Elon any more … even anonymously,” one told the Guardian via text message, without elaborating.

There was no response to a request for an interview with the company or for access to the factory.

Arne Christiani, the mayor of Grünheide and an unwavering Musk enthusiast, said he was confident Tesla would stay in Grünheide and would thrive. He was unmoved, he said, by what Musk said or did. “You have to distinguish between what happens in the US and here in Grünheide,” he said.

Hoyer, who lives 9km from the factory, said she had not relinquished her dream of one day being able to see a starry sky from her garden again. “Since the factory was built the light pollution from the round-the-clock operation has put paid to that,” she said, showing before and after pictures on her mobile phone.

Read the whole story
sarcozona
4 hours ago
reply
Epiphyte City
Share this story
Delete

Inside Gaza’s ‘death traps’

1 Comment
JavaScript is disabled in your browser.

Please enable JavaScript to proceed.

Read the whole story
sarcozona
6 hours ago
reply
Even the financial times knows Israel is genocidal
Epiphyte City
Share this story
Delete

GLP-1 drugs for addiction: Confidence grows in new treatment option | STAT

1 Share

WERNERSVILLE, Pa. — To make sense of the reds and greens dancing across a computer monitor displaying a scale image of a human brain, one requires a vivid vocabulary. At this upscale addiction treatment facility, “neurofeedback therapy” and “quantitative electroencephalogram” are part of the holistic, no-expenses-spared treatment philosophy on offer. 

But customized brain scans aren’t the technology that has both staff and patients here most excited. Lately, the bigger paradigm shift has come in the form of semaglutide — the blockbuster medication commonly used for weight loss and branded as Ozempic or Wegovy. 

In recent months, doctors at Caron Treatment Centers, an elite nonprofit rehab facility, have begun prescribing semaglutide to patients not to address obesity or diabetes but to help treat the addictions that brought them here in the first place. 

“I don’t think of this as doing anything wild west,” said Steven Klein, one of the staff physicians who has pioneered the practice of prescribing GLP-1s, as the class of medications is known, as a treatment for addiction. “We’re using something off-label under the umbrella of addiction, whether that be food, sex, alcohol, or opioids.” 

Despite Klein’s attempts to downplay the program, Caron is, without a doubt, in uncharted territory. While the medications show significant promise as addiction treatments, only a handful of clinical trials are underway to measure their ability to reduce substance use. Several are unlikely to publish results within the next two years. 

At this idyllic facility 70 miles outside Philadelphia, however, Klein and two fellow doctors are bypassing the speculation and the slow-moving scientific enterprise. No program has so openly and aggressively touted GLP-1s as a means of treating substance use disorder. And while their operation is backed by limited clinical data, their own eyes are giving them more confidence day by day. 

Remarkably, all three of the physicians are in long-term recovery from addiction: Mo Sarhan, who recently decamped Pennsylvania to run Caron’s sister facility in Florida; Adam Scioli, the organization’s chief medical officer; and Klein, who evangelizes GLP-1s both because they’ve helped his patients’ recovery and because, in 2023, he used them to drop 40 pounds. 

With a combined 250,000 Americans dying each year from drug overdose and alcohol-related causes, the field of addiction treatment is ripe for a paradigm shift. What few medications do exist for substance use disorders are either marginally effective or sorely underutilized. For opioid addiction, buprenorphine — which Caron also offers — and another medication, methadone, face immense stigma. For alcohol, medications like naltrexone or acamprosate have only marginal benefits. For some substances, like methamphetamine or cocaine, there’s no medication treatment at all. 

Until now — that is, if Caron’s doctors are right. 

The Caron doctors, outwardly, try to temper their optimism, but it’s clear that each views GLP-1s as a potential game-changer. Sarhan, who had noticed in his own Alcoholics Anonymous group that people using the medications for weight loss fared better in their recovery, said in a recent interview that semaglutide has “obliterated” many of his patients’ cravings for the substances they previously used, including opioids, alcohol, and stimulants.

And even outside the context of addiction, it seems the medications could redefine human beings’ relationship with many forms of pleasure. 

In interviews, experts reported to STAT a remarkable array of potential uses or, in some cases, anecdotes of GLP-1s appearing to transform people’s addictive relationships with tobacco, nail-biting, drinking, gambling, drugs, sex, shopping, and more.

Even Klein, who never exhibited problem gambling behavior, has seen his habits shift: Years ago, while driving to family vacations on the Jersey Shore, he’d stop for an hour of blackjack at the Borgata in Atlantic City. Now, he said, he’d rather just get to the beach.

The shift fits perfectly into his broader philosophy on GLP-1s: That brains battling addiction often generate urges to take part in a harmful behavior, be it big or small. The medications, in his own experience, clearly play a role in quieting those voices. 

“I had this record in my brain that meant when I’m stressed, I overeat,” Klein said. “The GLP-1s just lifted the needle off that record. I know what drug addiction feels like. I know that those voices are the same. I know they’re my voice, convincing me to do things I really don’t want to do.” 

Addiction and trauma: one doctor’s story

Like so many stories of addiction and redemption, Klein’s tale begins with an acute trauma. 

As a high-school sophomore on the track and field team, Klein was practicing shot put when an elderly assistant coach wandered into the path of his throw. The metal sphere struck the coach in the head, and he died a week later. 

Klein’s substance use began almost immediately: a means of staving off memories of the nightmare scene that had played out in front of him. 

“I still can recall aspects of the facial expression that he had, and being picked up from school by my parents and going to the hospital with my dad,” he said. “I just really started to fear sleep. So I started using stimulants to not go to sleep, or drinking alcohol until I would pass out into a blacked-out stupor. It was less that the amount of those substances mattered, but more that I developed that neurocognitive link where if I have a feeling and I don’t want to deal with it, substances are the answer.”

But throughout his childhood and into his medical training, Klein’s substance use largely remained under control. In the early going, it was no match for the more productive half of his brain: the half that led him to California in pursuit of an M.D.-Ph.D. with a focus in genetics. 

Klein, briefly, was living the good life: a full ride at UCLA, bylines in prestigious medical journals, contributions to significant advances in genetics research, and engaged and planning a Napa Valley wedding with a wealthy fiance. But when his partner confessed to an affair and the relationship collapsed, his addiction quickly escalated. 

“My life kind of fell apart,” he said. “All of this armor just disappeared, and I proceeded to drink the way, as it’s been explained to me, that I’d always wanted to, which was without any concern for what was going on. I found myself in the midst of a very bad drug and benzodiazepine and cocaine addiction that was probably going to ruin me.” 

Klein, at one point, even moved to an apartment in West Hollywood just steps from his favorite bar — a defense mechanism of sorts, aimed not only at convenience but also at preventing him from drinking and driving.

Luckily, Klein’s home was also steps away from the famed West Hollywood Recovery Center, an epicenter of addiction recovery in Southern California known as the “Log Cabin.”

In 2016, he began attending an Alcoholics Anonymous meeting there. He has been sober since. 

But even as his sobriety remained consistent, other aspects of his health flagged — particularly during and after the Covid-19 pandemic, when his weight jumped and his metabolic panels looked, in the words of Klein’s doctor, “kind of crap.” 

After several cycles of an enthusiastic week of dieting and exercise followed by several weeks with neither, the doctor suggested he might be a candidate for Mounjaro, a formulation of the GLP-1 tirzepatide, a close relative of semaglutide. 

The effect was transformative. Klein quickly began losing weight and experienced next to no side effects: He threw up only once, he said, on the day of one of his largest dose increases, but has since learned to navigate the gastrointestinal issues common in those who take GLP-1s. 

Today, Klein is happily married, planning to start a family, and has made peace with pivoting from a high-octane research career to a calmer existence in the Philadelphia exurbs. 

Now, he spends his long drives sponsoring other AA participants by phone, leads group therapy sessions, and offers lectures on neurobiology to Caron participants. He’s adopted a regimen of strength training that has made his Mounjaro-backed weight loss both healthy and sustainable. And he has established himself as a sounding board for other doctors hoping to use GLP-1s to help their patients fight addiction, lecturing in webinars and even orchestrating a Google group in which he tutors other doctors on semaglutide treatment, often sharing the consent forms he gives his patients and strategies for fighting obstinate insurance companies.

“I truly think these medications work,” he said, “because obesity is an addiction to food.”

As Klein reeled from his breakup and his substance use spiraled, a different doctor was also seeking recovery along the California coast: Mo Sarhan. 

As part of his treatment plan at a facility in Malibu, Sarhan was regularly driven east across the Pacific Coast Highway, through Santa Monica, all the way to the West Hollywood Recovery Center. 

At the time, the GLP-1 craze was just beginning. But in a city known for its focus on both looks and physical fitness, the medications’ use was already widespread, even in Sarhan’s Alcoholics Anonymous group. 

“I would hear people say they got put on medications for diabetes or weight loss, and then all of a sudden they’re feeling happier and healthier, and they’re not craving as much, and their periods of sobriety are getting longer and longer,” he said. 

Soon after Sarhan began attending the meeting, he met Klein’s sponsor, who, upon learning Sarhan also worked in medicine, suggested he meet. The two quickly became friends, even attending group sessions known as caduceus meetings, tailored specifically to medical professionals in recovery. 

When Sarhan moved to Pennsylvania and eventually wound up as an addiction medicine fellow at Caron, he began prescribing the medications, too: not explicitly for addiction, but to patients who otherwise met criteria for GLP-1s based on their body mass index or a diagnosis of type 2 diabetes. The results were so compelling that he soon created what he called “The Ozempic Files”: a specific repository of data about his own patients and their successes on semaglutide. 

And as Klein continued to pursue his prestigious career as a physician-scientist, focused on treating pediatric patients navigating complex genetic conditions, it was Sarhan who observed that he seemed “miserable” and suggested he apply to the same addiction medicine fellowship at Caron. 

Sarhan, upon moving to work at Caron’s facility in Delray Beach, Fla., effectively handed the baton to Klein, who stepped into his role working with the “relapse unit” — patients making a new attempt at recovery after trying, unsuccessfully, before. 

It was in this unit that Sarhan first noticed GLP-1s’ potential, in the subset of patients eligible for the drugs on weight loss grounds. 

“I started to hear from them that things felt different,” Sarhan said. “They weren’t craving as much, and they were more engaged in their recovery programs. So I started doing it a little bit more frequently and with a little more intention. And then, when Steve joined, we joined forces.” 

Caron’s current protocol for using GLP-1s bears only a passing resemblance to the protocol for patients seeking treatment for obesity, diabetes, sleep apnea, or a related condition. 

While doses for the brand-name drugs often exceed 2 milligrams per week, Caron uses an initiation dose of 0.25 mg and then doubles it if patients remain comfortable. And instead of the high-priced injector pens like Mounjaro or Ozempic, Klein has largely opted for cheap, compounded versions of the medications, especially for patients who are not otherwise eligible for GLP-1s. 

But it remains an open question whether Caron’s early success in employing semaglutide is more broadly applicable across addiction medicine. 

Caron is an elite nonprofit dating back to 1957 that reported $85 million in revenue in 2024, according to a recent tax disclosure. The hourlong drive to its campus from Philadelphia separates some of the nation’s most drug-ravaged neighborhoods from rolling hills and impressive mansions lining roads dotted with yellow signs depicting a horse and buggy, warning motorists to leave room for Amish carriages on the side of the road. 

Beyond using high-tech brain scans, Caron patients have access to dietitians and a full suite of medical services, with specific wards for older patients or those with severe physical impairments. As an inpatient rehab facility, its environment is tightly controlled, making weekly injections practical here in a way they likely wouldn’t be for someone using fentanyl or meth on the street. 

Many patients pay dearly for this privilege: While some have insurance coverage, others pay $30,000 for a standard one-month course of treatment or as much as $65,000 for a premium package and private room. 

The clinical support is also robust. Besides the wealth of services, patients have access to three doctors whose lived experience with substance use gives them unparalleled insights into their patients’ journey. Scioli, the chief medical officer, is a former board chair of International Doctors in Alcoholics Anonymous. 

Even for successful patients at Caron, however, it’s difficult to apportion credit to GLP-1s or to the many other bells and whistles their treatment program offers. 

“It’s really hard to extract what role these medications really played,” Sarhan said. “I put him on Mounjaro for type 2 diabetes. He’s lost a lot of weight, he was on the men’s relapse unit and was still sober a year after treatment, which is the second-longest period of sobriety he’s had. He’s been really engaged in 12-step programming, and he personally attributes his ongoing success to his engagement in AA.” 

The reality, Sarhan said, is likely more nuanced: AA is certainly playing a role, but so are Caron’s other treatment offerings, and so is Mounjaro. 

Caron maintains a strict confidentiality policy. Visitors sign documentation swearing to not reveal the identity of anyone they might come across on Caron’s campus, and the staff bars current and former participants from media interviews until they’ve achieved a full year in stable recovery. Given that Caron’s new GLP-1 program only began in April, no patients were available to describe their experience with the medications firsthand. 

Among the doctors, however, optimism abounds, and not just when it comes to substances. Amid a sharp rise in gambling and associated harms, Sarhan, in particular, is intrigued by GLP-1s’ potential use to treat other behavioral disorders. 

“Gambling disorder is the one that has the most of my attention,” Sarhan said. “The neurobiology and neurochemistry are remarkably similar.” 

It is unclear, however, whether Sarhan and Klein’s degree of optimism is justified: In particular, the two largely gloss over concerns about side effects, even though more than half of patients in one small study of GLP-1s for opioid use disorder withdrew from the three-week study because of gastrointestinal discomfort.

Unlike typical GLP-1 patients, not all who use the medications to treat addiction are overweight. But Klein and Sarhan don’t appear worried about malnutrition or muscle loss, in part because Caron’s 0.5-mg dosing protocol is so much lower than that for obesity. As of July, Klein said 229 Caron patients have received a GLP-1 since the start of 2024, the majority of whom started using the medication during treatment (though some arrived already taking them).  

That includes 47 patients who were given compounded GLP-1s specifically for addiction under a new program. Of that group, Klein said 70% have a BMI above 30, and 80% meet the criteria for receiving the medications on physical health grounds alone.   

Still, Caron’s specific initiative is unlikely to yield satisfying answers. The project is not being run as a formal study, but rather as a “clinical initiative” that will tabulate patient data and record outcomes but does not qualify as a clinical trial. 

“We don’t want people to get too far ahead of where the evidence currently is in terms of using these drugs,” said Stephanie Weiss, a staff clinician at the National Institute on Drug Abuse whose research focuses on GLP-1s and addiction, but is not involved in Caron’s new semaglutide initiative. “There’s no such thing as a silver bullet, and it’s probably better not to think of addiction as one single disease.”

But she allowed that logically, the mechanisms that make GLP-1s so effective at curbing hunger and “food noise” are deeply interrelated with the mental processes of addiction, and that the medications likely represent a paradigm shift in addiction medicine. 

“All these things interconnect in the central nervous system, and we don’t fully even understand all these pathways yet,” she said. “The level of impact we’re talking about does seem to be on a different plane.”

The future of Caron’s program is uncertain. It was launched with a grant from the Center for Addiction, Science, Policy, and Research, an advocacy group founded in 2024 that has spent its first year advocating for wider access to GLP-1s as an addiction treatment. But the program relies on Caron’s ability to source cheap, compounded semaglutide. To date, no GLP-1 medication has received a formal indication as an addiction treatment. No insurer has agreed to reimburse for expensive, brand-name GLP-1s for addiction, and many are even rolling back their coverage of the medications when prescribed for weight loss.  

Caron’s access to compounded versions may soon disappear, thanks to a Food and Drug Administration ruling that would bar compounding pharmacies from continuing to produce them. 

Klein is already scheming a fix. A select few compounding pharmacies, for instance, were licensed to produce semaglutide with a 12-month expiration date, taking supply well into 2026. Generic medications could become available in Canada next year. 

Beyond the result of Caron’s experiment, much remains to be seen about the GLP-1s’ effectiveness as addiction treatments — particularly for the typical patient unable to afford a clinic like Caron — as well as drug companies’ willingness to market them as addiction treatments, and patients’ ability to access and tolerate them.

To Sarhan, two things already seem clear: Patients are interested in GLP-1s, and the medications are producing results unlike any he’s seen. 

“GLP-1 medications are all-around more attractive to patients: They’re new, they’re sexy, they cause weight loss, they jump-start wellness,” he said. 

“I’ve never had a person who I’ve started on naltrexone turn around and tell me that their cravings have been obliterated,” he added, referencing a common treatment for alcohol and opioid addiction. “Whereas I have had that happen with people who’ve been started on GLP-1 medications.”

STAT’s coverage of chronic health issues is supported by a grant from Bloomberg Philanthropies. Our financial supporters are not involved in any decisions about our journalism.

Read the whole story
sarcozona
9 hours ago
reply
Epiphyte City
Share this story
Delete

Some of the most annoying parts of a hospital visit can save lives | STAT

1 Share

The first things I noticed when I woke up after my recent wrist operation at Mass General Hospital were the “YES” stenciled in purple ink on my right thumb, a marker that had given my surgeon the green light to operate on that hand, and an equally luminous “BLOCK” on the same arm, which was the anesthesiologist’s separate signpost for where to administer a nerve block.

The second was how many times — five or more, within 90 minutes — that a receptionist, nurses, and doctors asked me to confirm my name, date of birth, and why I was there. I was also asked whether I had allergies, metal or other implants, and other medical bugs, even though I’d answered all of that in an online questionnaire a week earlier and a pre-surgery check-in over the phone. And I had to sign a form making clear I understood the procedure in store.

Such obsessing would have annoyed most patients, but I was tickled. It reminded me how much things have changed, for the better, since I wrote a four-part series on hospital errors a quarter-century ago for the Boston Globe.

Back then, it was too common for surgeons to repair the wrong knee, remove an appendix instead of a gallbladder, or treat the healthy rather than the sick side of the brain. Or, worse still, to operate on the wrong patient. Which was understandable if not forgivable, since we all forget, and in 1999, fewer if any, health care providers were meticulously scribbling reminders or compulsively asking questions.

Today’s simple, if irritating, measures have saved lives as well as embarrassment, the experts say, even if nobody is quite sure how many.

Part of the problem in tracking that progress is a lack of reporting. It’s required for hospitals to report mistakes like wrong-site surgeries, but the Betsey Lehman Center for Patient Safety in Boston estimates that as many as 85% of all harmful events go unchronicled. Errors can crop up even when doctors do everything right, but later find, for instance, that X-rays or biopsies were incorrectly read or dictated. Those and other deficits, specialists say, are bad at same-day surgery centers and worst at ones that do just one kind of procedure, be it removing cataracts or repairing joints.

But Robbie Goldstein, commissioner of public health in Massachusetts, cautioned that the problem is not the nature of such procedures. “It’s the separate danger of private-equity ownership,” which tends to be higher in such settings and is growing across the board. With corporations controlling things, he added, “it’s not about health care, it’s how many widgets you can get through the machine.”

Whatever the reason, the result is that too few hospitals foster the culture of safety that I experienced at Mass General. Such a culture means putting in place the multitiered checks and rechecks we associate more with airline cockpits and nuclear plants than surgical suites. Some of those I could see, but others — like the mandatory huddles of the surgical team before, during, and after the operation, where everyone confirms that the patient, procedure, and site are correct, and all are encouraged to raise concerns — I learned about only afterward from my doc and from Gerard Doherty, Mass General’s chair of surgery. It’s made clear to MGH staff that they won’t be berated or dismissed if they admit an error. And “in rare cases where we do identify missteps,” Doherty said, “we try to be just as rigorous about figuring out what went wrong so that the same thing doesn’t happen to another patient.”

A 2023 report in the Joint Commission Journal dug deeper into ongoing problems at hospitals nationwide with wrong-site surgery — the fifth most common harmful medical mistake — by analyzing data from legal claims filed with a medical malpractice company. The biggest problems were with orthopedic operations (35%), neurosurgery (22%), and urology (9%). The most common injuries that patients linked to the errors were the need for additional surgery (46%), pain (34%), mobility dysfunction (10%), worsened injury (9%), and death (7%). The average of the 68 claims analyzed was for $136,452.84, and 60% of cases were settled.

“The overwhelming top contributing factor to [wrong-site surgery] was failure to follow policy/protocol (83.8%) and failure to review the medical records (41.2%),” the authors wrote. “Safety measures need to be followed to prevent errors, and determining why they are not being used is key.”

That’s what the Joint Commission is trying to do, said Jonathan Perlin, head of that body, which accredits the safety at most U.S. health care institutions and which in 2004 adopted a formal standard for preventing wrong-site, wrong-procedure, and wrong-person surgery. “Like you,” Perlin told me, “when I had a procedure where the side and site are critical (kidney) some years back, I thanked the nurse for signing my belly. She was surprised by my thanks, and I explained that I know and [have] seen what happens when there’s a mix-up.”

All told, preventable medical errors are believed to kill 200,000 patients every year in America, making them the third leading cause of death.

Doctors are more likely today to operate on the correct parts of the correct patient, but there are other types of errors. Omissions in the care of patients “are responsible for the lion’s share of harm,” said Barbara Fain, director of the Lehman Center, which Massachusetts launched in honor of my former Globecolleague Betsey, who died in 1994 after doctors gave her four times the intended dose of a powerful chemotherapy drug. The most frequent and dangerous oversights and failures, Fain added, are infections, falls, pressure ulcers, and delays in diagnosis and treatment.

In my case, I knew from that long-ago Globeseries and from 24 years running a Boston-based fellowship program for health journalists that there are two ways patients, on their own, can reduce the risk of post-operative infections: be the first case of the day, when surgeons are most likely to be on schedule and anesthesiologists can best time their antibiotics, and be at a top-notch outpatient surgery center, where there are fewer germs circulating than in a big hospital. I did both.

So overall, are patients like me better off than when I wrote my “Patients at Risk” series in 1999?

Logic — and the still visible YES on my thumb — say so, and I’m willing to grasp at any good news these days on the health front. So is my surgeon, Neal Chen, head of Mass General’s Hand and Arm Service. When he was a resident 20 years ago, he explained, one of his mentors urged him to “look at the papers and make sure the patient’s name is right. Make sure it’s the right arm. If you get either wrong, you’re the stupidest person in the world. … He’d figured out that [simple] systems are really crucial to delivering good care, and he was way ahead of his time with that.”

Larry Tye is an author based on Cape Cod, Mass., and runs a health journalism fellowship based at the Harvard T.H. Chan School of Public Health.

Read the whole story
sarcozona
9 hours ago
reply
Epiphyte City
Share this story
Delete

Federal watchdog urges more oversight of WIPP maintenance • Source New Mexico

1 Share

Degraded infrastructure and lax federal oversight of maintenance contractors threatens future operation of the only underground nuclear storage site in Southern New Mexico, according to a recent report issued by a federal watchdog.

Maintenance concerns at the Waste Isolation Pilot Plant, which lies underground in a saltbed outside of Carlsbad, comes as federal officials plan to accept radioactive waste until the 2080s — more than 50 years beyond the original expectation.

The federal government contracts out daily operations and maintenance at WIPP, and has used the Salado Isolation Mining Contractors at the site since 2022.

In June, the Government Accountability Office issued its findings on WIPP maintenance and facilities, noting that more than half of the necessary equipment and infrastructure (called “mission-critical”) is reported to be in “poor and substandard conditions.”

The report notes that according to the U.S. Department of Energy — which oversees U.S. nuclear weapons programs, including disposal — having infrastructure in poor condition “increases the potential for infrastructure failure, allowing greater risk of unforeseen delay to waste disposal operations or shutdown of the site.”

Federal officials in the Carlsbad Field Office said WIPP has been in a “reactionary mode” to keep up with repairs and aging equipment since a series of 2014 accidents, which shut down operations for several years, according to the report.

WIPP only had a projected lifespan of 25 years, Don Hancock, a decades-long anti-nuclear advocate in New Mexico with the Southwest Research and Information Center, told Source NM.

“Because of the obsolescence of many things in this facility, these problems will continue to occur,” Hancock said. “This facility was never supposed to, and won’t, successfully operate for 80 to 85 years without other accidents.”

The report noted some improvements since a 2016 review, which documented more than $37 million worth of repairs that occurred behind schedule and with missed deadlines. However, new problems  have since emerged.

Those problems include  a shaft used to transport salt removed from underground that underwent emergency refurbishment in 2024 due to “a high risk of failure,” after salt pushed into the shaft faster than expected. The report noted that other replacements were needed, such as a salt hoist built 1924 and installed at WIPP in 1984.

Contractors have identified more than 100 priority repairs at WIPP, which extend into 2033, but missing or incomplete data about the condition of the infrastructure have continued to occur, the report found. Federal officials failed to hold contractors accountable for the lapses and did not set timelines for fixes to be in place, the report said.

The report concludes that federal officials need to enact further oversight, such as grading contractors on their long-term planning efforts in contract evaluations; ensuring data issues are addressed; and setting deadlines to implement fixes.

In a response letter, U.S. Department of Energy officials agreed to the recommendations, and said they would implement them in 2025 and 2026.

Hancock said the report highlights the need for further nuclear storage options, outside of New Mexico.

“As long as WIPP is the only repository — whether it’s safe or not, whether it’s obsolete or not, whether it’s falling apart or not, whether they’re adequately maintaining it or not — if it’s the only one, everything ultimately will be shoehorned in,” Hancock said.

Read the whole story
sarcozona
12 hours ago
reply
Epiphyte City
Share this story
Delete
Next Page of Stories