Non-Fiction Reviews


Big data, data torturing and the assault on science

(2023) Garry Smith, Oxford University Press, £25, hrdbk, 323pp, ISBN 978-0-192-86845-9


There have always been folk who distrusted science and believed in woo-woo and other nonsense. For example, over the years John Grant has had a number of looks at these with: Bogus Science or, Some People Really Believe These Things, Corrupted Science: Fraud, Ideology and Politics in Science and Denying Science: Conspiracy theories, media distortions and the war against reality.  All are solid works. But the stream of faux-science, fake news and the like is never ending. The latest counter to all of this comes from economist Gary Smith.

In Distrust economist Gary Smith spends a little time in this fake news and urban myth territory (which is very interesting) but the main thrust, as revealed by the book's sub-title, is that scientists have brought this distrust on themselves thorough highly flawed analyses that abuse statistical methodology.

Early on, Smith looks at telepathy experiments using Zener Card guessing: where the experimenter looks at a card and the subject has to try to read the experimenter's mind. Here, the botanist J. B. Rhine (clearly operating outside his area of expertise) claimed to statistically prove that actual predictions were being telepathically made. However, a look at his work reveals extremely shoddy statistical practices. For example, if the experimenter reveals each card after the subject has tried to guess it – and given there are only five cards (circle, cross, wavey lines, square and star) – it is possible for the subject having learned that two wavey lines, say, had recently been presented, guessing that the next card would not likely be that and so the chance of making a correct prediction would increase (one in four as opposed to one in five). Also, Rhine apparently conducted thousands of experiments but only reported on those that seemed to show telepathic communication between the experimenter seeing the cards and the subject. Had all the other experiments been taken into account then these apparently successful experiments would be what one would expect with just sheer chance.

Gary Smith does take us on some interesting detours such as the BBC spoof documentary Ghostwatch (1992) which only very briefly hinted that it was in fact a Halloween spoof: the BBC got its knuckles rapped by the ombudsman (then the Broadcasting Standards Commission) over this. He also spends some time on flying saucers, crop circles and the Apollo Moon landings being fake theories, providing the rational answers most of us in the SF and science community know.

Then there are the theories that the world is being run by a small group, an elite of some sort. It is frightening – if the surveys and polls are to be believed – that nearly a fifth of US citizens believe in 2011 that the 9/11 attack on the World Trade Centre in 2001 was a government plot to provide an excuse to go to war in the Middle East.

One such belief is that a wealthy Jewish controller, George Soros, was in on the British pound's financial crisis in 1992. Actually, as Gary Smith explains, Soros just used common sense about the pound being over-valued when it joined the European Exchange Rate Mechanism (ERM) in 1990 and borrowed a load of money to make an exchange rate bet which when it paid off he repaid the borrowed cash and kept a hefty profit. No conspiracy was needed.

To return to this book's sub-title, Gary Smith then turns to the creation of big data. Here I am totally with him in that with big data we are creating monsters. For example, as he points out, we can be pretty confident that anything that, say, Google provides which is free is being used to gather data to fuel its advertising revenue. This is why it likes gathering data on its advertisers potential customers – you. It would theoretically be possible for Google to scan every email you send/receive with Gmail, note the websites and pages you have visited with Google Chrome, every word you write with Google Doc, every number you crunch with Google Sheets, etc. Over all, as Gary Smith points out, in 2022 Google provided nearly 300 products with an estimated 1.8 million using Gmail (e-mail) and more than 3 billion using the Google Chrome web browser. Google has purchased MasterCard transaction data to link advertising to sales and also tracks sales receipts to Gmail accounts. This is not a secret. If you have one you can go to your Google account and check (Of course, if you don't have a Google account you will not know what they know about you, and even if you do have an account you bet that you will not get to learn all that they do have on you.)

As said, I have long had concerns about big data and I am totally gobsmacked by the way very many science fiction fans (including those who are scientists and who should know better) are sleep walking into big data – happy to use: the cloud, services such as EventBrite, buy into the Amazon dream (despite worrying aspects about the company, scroll down this link), and so forth. For goodness sake, haven’t SF fans read John Brunner, George Orwell, William Gibson et al.?

The thing, as Gary Smith points out, is that this monster of big data is something that we scientists and engineers have created. Just as the internet can be used for good to access education and other worthy goals, it can also be used to spread and reinforce conspiracy theories, frauds, political ideologies. What happened with Hitler in the early 20th century can happen far, far more faster, and more thoroughly, with Trump, Putin, Brexit etc., in the 21st century.

As Distrust progresses, so it slowly homes in on its central conceit, that we in the science community ourselves are actively sowing the seeds of public distrust through bad science, abusing statistics, and torturing data.

Gary Smith makes a good case and I am with him in terms of his sentiments. He also makes some sensible suggestions at the book's end as to what to do about it all. But, sadly I am not sure that the people with access to the levers to implement these will pull them. Nonetheless, this last is no excuse for not getting the message out there.

Further, while I agree with Gary Smith on much, reading the book does give the impression that there is an awful lot of p-hacking, data torturing going on. True, there is some, and true it is deplorable, but when I read my weekly copies on Nature and Science, most of the papers do not rely on statistically analysed quantitative data and my feeling is that the majority of those that do are not involved in data torturing. Conversely, in climate change science we have had denialists dispute warming data and the 2009 so-called climategate (which made exactly the same point I did in my 1998 textbook). So with us (in the environmental, geo- and bioscience community) it has been members of the public abusing data…

Having said that, there is one area of science which I fear serious misuse of statistics if not the outright fraudulent creation or raw data and that is in clinical research. For example, a recent comment piece in Nature (van Noorden, R. (2023) How Many Clinical Trials Can’t Be Trusted?. Nature, vol. 619, p455-8) looked at a 2021 analysis of clinical trial papers (Carlisle, J. B. (2021) False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. vol. 76, p472–479) that showed that one-quarter of a subset of manuscripts describing randomised clinical trials submitted to the journal Anaesthesia between 2017 and 2020 seemed to be faked or fatally flawed when their raw data could be examined. In this area of science, Gary Smith has a real point. On a personal basis, I never accept a drug from my doctor that is currently being trialled unless the research is conducted by a the UK Russell Group team funded either by the Medical Research Council and/or the Wellcome Trust: I prefer my pharmaceuticals to have passed trials at least half a decade ago so that any bugs in trials and subsequent treatment have been identified. (You may think me picky but then how many of you have had a career in science communication and bioscience policy?)

So, where does all this leave us and the book Distrust: Big data, data torturing and the assault on science?

Well, for the most part the book is a fairly easy read. It also has physically high production values with full-colour illustrations throughout and not those just confined to a bound-in, centrefold insert. It's message is also one that I believe needs to be promulgated (even if it is perhaps a little over-egged). Indeed the author has been published a number of times before, including by Oxford University Press (OUP) such as with The AI Delusion, a title that itself touches on a number of this book's themes.… Given all this, I am reluctant to knock this title. Yet sadly I feel that this book would have greatly (hugely, ginormously, tremendously, vastly) benefited from being treated as an early draft: it certainly should not have been published in its current state. (I am guessing because of the author's past successes with OUP they were keen to publish, and the author keen to get published and move on to his next venture.)

Now, I had a few minor niggles. For example, I thought the author was a tad harsh on the Christmas edition of the British Medical Journal (BMJ) which is meant to have spoof papers as it is a deliberately jokey edition. Having said that, I am aware that such jokes can be taken seriously when they certainly should not. (I myself have had direct experience of this having co-authored – under a pseudonym so no use looking – a spoof article on a device to reduce bovine high greenhouse global warming potential emissions through combustion in the Journal of Biological Education in the 1990s. We got dozens of reprint requests who apparently did not see the joke.) But minor niggles aside, I had more serious concerns.

First up and on a first read, though the book is well written in terms of readability, at times – though entertained – I could not see why he was saying what he was saying with regards to the book's central conceit. Why, for example, was there a discussion on crop circles (as interesting as that was) given there was no data to torture? Having read the book and then having quickly scan read it again prior to writing this review, it dawned upon me that the author was saying that we humans were prone to believe in nonsense and that this stems from bad science. This is arguably a bit of a stretch, but given this, then later in the book he takes this further, that many folk who believe in nonsense will be taken in when the nonsense is given a (fake) veneer of seemingly solid, scientific validity. This is one of the key points underpinning his argument and making the connections between them need to be spelt out: they should not be left to the reader to work out how to join the dots. With a first draft of as new book it is always wise (well, I find it at least) to walk away (leave it completely alone for a couple of months) and then come back to it fresh and see if it all hangs together.

In addition to properly joining the dots and progressing the book's case, such a re-read also helps provide balance. Again, as entertaining as it was (and it was entertaining), perhaps a little too much time was spent on cryptocurrencies, for which the author seems to have considerable disdain, at the expense of other examples not included or perhaps the further fleshing out of other examples included, or even other key aspects of the issue.

One instance of this last would have been why do people believe in nonsense, hence are willing to believe in fake science that seems to support their beliefs? (Instead, the author only focuses on the internet as a means to spread such nonsense.) He does not mention, for instance, that we humans are hard-wired for a bias between what are called type-1 and type-2 errors: being wrong when we should be right and thinking we are right when we should be wrong. An example of type-1/type-2 bias is that in the free developed world we have a bias in someone being innocent until proven guilty: we do not want to see innocent people punished for crimes they did not commit and prioritise that on balance slightly above the guilty getting punished. In nature, we became hard-wired when sensing whether or not there is a predator lurking out there in the long grass: those who react quickly with unthinkingly fast speed survive because they do, even if in many instances there is no predator there. Conversely, those that don't react with such unthinking speed are quite likely to become lunch. (To be fair, he does refer in a couple of places to 'false positives' and these are half of the type-1/type-2 error concerns.)

Further, people tend to adopt the beliefs of their peers as that binds them together: a group bound together – even through irrational belief (such as fundamental religion) – have a far better survival chance than individuals going it alone… (One might argue that such belief problems began when society started exceeding Dunbar's number and the need to bind together different small groups into a state/nation.)  I could go on, but you get my drift: evolution has made us this way.

I am not sure why others – the book's referees – had not picked up on some of these points. I hope it was not because that the referees were kowtowed by the author's eminence in economics academia? (I am of the firm belief that referees should be blind to potential authors and vice-versa until the referee process is complete. I understand that this is the case for some leading journals – such as Nature? The publisher might seriously want to look into this when they next have a look at improving their referee process. Referees will wail at this because, not knowing how eminent or not an author is, it makes them to have to properly referee the work and not the author: senior academics are capable of writing crap books and junior scientists are capable of brilliance.)

Then again, I am not sure who Distrust is aimed at? As said, it is written in a very readable style and so one might think that it is aimed at the general public concerned about issues, for example, such as distrust in vaccines and the spread of fake CoVID cures both of which the book covers. But then we get stuck into things like statistical significance but what statistical significance is, is not properly explained for the lay person (let alone, I would argue, for the scientists who hate statistics) and the space given to the explanation is far too short. So, as it stands, this book really is mainly useful to scientists who use statistics in their work and who are already familiar with this dimension to scientists' tool kits.

Ditto, the explanation for Benford's Law (but then even its Wikipedia page is as bad, so the author is not alone)… I could go on.

Finally, and this really is unforgivable, especially for a book on science rigour, the referencing is effectively non-existent!  Come on!!!!

True, at the book's end what we do get is a list of works in a section called 'References' but they do not actually 'refer' to anywhere: there are no numbers linked to references in the text, no lead author and year in brackets in the text relating to the alphabetical by lead author list of references in the back.  Now, I usually do follow up a few references in books and papers that interest me and I had wanted to so with Distrust, but the closest I could come to this would be to wade through a dozen pages of (non-referring) 'references' to guess from papers' titles which might be the one of relevance to the part of book's text in which I was interested…

Now, there are many books on bad and even fraudulent science, fake news and crackpot theories: it has been an on-going issue for centuries. (A fun recent one is Conspiracy: A History of Bxllocks Theories and How Not to Fall For Them, 2022).  The issue of data-torturing and mis-use of significance has also been bubbling along for decades. (I remember when I was at the British Medical Association in the 1980s, the BMJ first made it a firm editorial requirement that paper submissions that analysed data used statistical significance: there was quite a debate.) So we do need a continual stream of books addressing these issues. Distrust could so easily have been one of these, and a good one too, if only OUP had asked the author for more focus, to join the dots (mainly just a few sentences in each of the chapters final paragraphs), and most certainly sorted out the references.  A great, great shame.  This could so easily have been a wonderful book.  As it is, it does need a good polish (and the referencing sorted out). Had it had one, then it would have been a gem I for sure would have liked to see.

Nonetheless, despite all of the above, it is whole-heartedly recommended for scientists who work analysing large data sets… but it could have been of use to so many more.

Jonathan Cowie


[Up: Non-Fiction Index | Top: Concatenation]

[Updated: 23.9.15 | Contact | Copyright | Privacy]