Introduction
Data Contamination
Model Collapse
The model is the message
Practical Implementation
Visual Practice as Critique
Conclusion
References

Midjourney Distilled: Self-Cannibalisation and Collapse as methods for revealing an AI model’s ‘House Style’

Martin Disley
Institute for Design Informatics, University of Edinburgh, Edinburgh, UK
m.disley@ed.ac.uk

DOI 10.34626/2024_xcoax/classof24_003

Abstract

This text concerns the relationship between state-of-the-art generative image models and homogeneity in visual culture. It introduces an experimental method for investigating the ways in which the latter might be technologically determined by the former and discusses, through a case study in visual practice, artistic strategies for disrupting this.

Keywords

Generative AI, Image synthesis, Aesthetics, Data Scarcity, Adversarial Practice.

1. Introduction

During the second half of the 20th century, the radioactivity of the earth's atmosphere changed. The nuclear arms race and associated testing of atomic and thermonuclear weapons increased the level of background radiation in the atmosphere to a degree measurable at any point on earth. Due to this, any industrial processes that involved oxygen, such as the production of steel, contaminated the resultant products with radionuclides. This was a problem for materials destined for use in radiation-sensitive applications, such as Geiger counters and medical equipment. Thus, there was a sudden and sustained demand for radionuclides-free steel, also known as low-background or pre-war steel.

The widespread adoption and use of large language models and diffusion models, prompted by the release of accessible text-based interfaces such as ChatGPT and Midjourney, has fundamentally altered the nature of text and images on the internet after 2022. But much like the contamination of industrial processes with radionuclides post-1945, the proliferation of AI-generated text and images into the pool of trainable data poses an analogous problem for applications that require human-created text and images post-2022. Applications of this nature include the training of the next generation of these models. Much as after 1945 there was a demand for low-background steel, after 2022 there is a need for an analogous low-background data.1

This text introduces a novel method for investigating this phenomenon and the effects this may have on visual culture, by experimentally accelerating the effects of a data-model feedback loop in diffusion models. The method reconfigures approaches from other studies that investigate data-model feedback loops in language models and applies these methods to image models. In addition, the text also discusses other ways that these models might currently be producing homogeneity in digital image production and, therefore, visual culture at large. The project also includes a practice-based component, which is also discussed here. Using a model produced by the method introduced here, a series of new image works have been created. A case study of this work adds an additional commentary on homogeneity in digital image production, but also complicates this narrative and points to some ways out of this loop for visual practitioners.

2. Data Contamination

Before detailing how the data-model feedback loop produces the effects on the model quality that it does, it's necessary to discuss the data resource context that has turned it from a theoretical phenomenon considered by researchers into a concrete problem for developers, technology entrepreneurs and researchers.

Conceived of as a single collective source, there exists no larger resource of data for foundation model training than the images and text that are accessible to programmatic data collection on the open, "clearnet", portion of the internet. The largest published datasets of public text data, such as RefinedWeb (Penedo et al. 2023), C4 (Raffel et al. 2020), Dolma (Soldaini et al. 2024) and The Pile (Gao et al. 2020), contain tens of trillions of words collected from billions of web pages, and equivalent corpuses exist of image data such as the LION5B dataset used to train Stable Diffusion (Schuhmann et al. 2022). Although many of these datasets remain unpublished, such as those used to train Open AI's GPT series and the Midjourney text-to-image series of models, as they are the proprietary assets of commercial ventures, we can infer that they are at least as large as their open-source counterparts, if not larger.

Under the current architectural paradigm, for models to continue to advance they must grow in both size of parameters and in size of training tokens. Hoffman et al find that for compute-optimal training, the model size and the number of training tokens must be scaled equally (Hoffmann et al. 2022); doubling the model size requires a doubling of data. However, the stocks of available data are not growing as fast as models are growing. In their recent paper, Villalobos et al produce a model of data scarcity based on estimates of internet data growth and model size growth and predict that model size will outpace stocks of available data by 2032 at the very latest (Villalobos et al. 2024).

Growth in available token stocks is complicated further by changes to the nature of consent protocols specifically limiting the collection of internet data for AI development. Longre et al note that restrictions on crawling are being placed legally, as stated in the domain providers terms of service, and programmatically via the Robots Exclusion Protocol (REP) expressed through a sites 'robots.txt' file. They estimate that in the last year, roughly 5% of the entire corpora of C4, RefinedWeb, and Dolma have become restricted by robots.txt files. This number rise to nearly a quarter when considering only the most salient domains to these datasets and nearly 45% of all tokens in these datasets now carry some form of restrictions when the domain’s Terms of Service clauses are considered (Longpre et al. 2024). More than simply running out of the web to scrape, consent protocols are actually shrinking the unrestricted data on the web.

The models trained on this data are now being used to contribute to it. In terms of user base, Open AI's ChatGPT is the fastest-growing software product ever released, and the impact of it and its competitors has been profound but estimating how much text produced using ChatGPT and its competitors and published on the internet is currently impossible. Other than by spotting the appearance of canned response messages produced by the software application the model is accessed through, when it has been asked about something that is against its safety policy, there exists no known method of reliably disentangling human-created text and images from AI-created text and images after 2022. This can be taken as evidence that there are far more synthetic text and images on the internet than we are able to detect. This means that the web crawlers, the software used to collate the next internet-scale training dataset, are now collecting synthetic images and text.

With neither methods for separating AI-generated content from human-generated content nor model optimisation without new data forthcoming, it is imperative to consider what happens when foundation models are trained on the output of previous generations.

3. Model Collapse

Shumailov et al (Shumailov et al. 2023 & Shumailov et al. 2024) find that incorporating generated data into the training sets of new generations leads to compounding defects in the resulting models. They find that over repeated training cycles the tails in the original training data's distribution start to disappear and over time the distribution converges on an invariant subset. Put another way, rather than simply stagnating the next generations of models' ability to produce greater diversity in outputs, feedback loops in data generation and model training actually reverse this capacity.

The cause of model collapse is straightforward. Every statistical modelling process suffers from sampling errors. That is, unless the data can be perfectly captured by a mathematical function, discrepancies between the sampled distribution and the actual distribution will exist (i.e. the mean of the dataset will be different from the true mean). Training models on data produced by previous iterations compounds these errors as they propagate through new iterations of the model. This induces a distribution shift, whereby previously distinct modes, or clusters, in the original distribution become entangled, resulting in the generated data tending towards unimodality. Over successive recursive cycles, this process results in a phenomenon they call model collapse.

Whilst data-model feedback loops produce a decline in variety in generated outputs vis-à-vis the outputs of previous generations of the model, there is evidence to suggest that generative models already produce a decline in variety vis-à-vis counterfactual human-produced counterparts across media, that is to say, a decline in variety were all the images generated by these models produced using other means. Before returning to the data-feedback loop as a method of exploring the relationship between these models and homogeneity in visual culture, it is worth detailing the processes at work that produce this effect.

A picture is worth a thousand tokens

As per the old adage, short text prompts are less information-rich than 1024x1024px images. The "magic" of generative text-to-image models lies in their capacity to perform cross-modal information upscaling; taking the short text prompt and producing a corresponding information-rich image. To restate what is apparent for emphasis: the model makes determinations on the composition and manifold qualities of the image on behalf of the user. This is what gives them their enormous utility. Naturally, this information upscaling, the "generative" part of the process, requires the ceding of control of those decisions from the user to a probabilistic extrapolation tuned towards a particular and limited interpretation of natural language and a particular and limited understanding of image fidelity. The particularities of that extrapolation are what determine the qualities of the images produced. They set bounds and direction of upscaling, defining limits on what images can and cannot be produced along with a general aesthetic character.

In The Nooscope Manifested, Matteo Pasquinelli and Vladan Joler (Pasquinelli and Joler 2021) note that predictive machine learning models are fundamentally unable to handle a truly unique anomaly. To the mechanised pattern recogniser - a system of learned but ultimately static representations - the anomalous is indistinguishable from noise; the heterogeneous is a pathological aberration. The same logic applies to generative applications including text-to-image models. They are unable to generate anything truly unlike that which they have been trained on, and the particularities of that training determine the particularities of the cross-modal upscaling.

I would argue that marketing material and images demonstrating the capacity of diffusion models often inadvertently demonstrate this very limitation through their heavy use of a kind of superficially surrealist aesthetic. These "surreal" demo images are just recompositions of familiar visual phenomena uncommonly composed together. They work well as demonstration images because they are both high-fidelity renderings and clearly distinguishable from photographs, thus demonstrating a capacity to produce images comparable in quality to, but improbable to produce with, lens-based image-making technologies. Pasquinelli and Joler term this phenomenon "the dictatorship of the past" (Pasquinelli and Joler 2021) arguing that it represents a hard logical limit in the capabilities of machine learning based artificial intelligence at large.

Recovering Zuboff, we can understand the wider economic conditions that preclude optimisation towards novelty. She writes that the pervasive system of surveillance, produced by the contemporary mode of value extraction she terms "surveillance capitalism", turns knowledge of present behaviour into an authoritarian project for total certitude. An economic imperative that necessarily restricts the future to a recreation of the recorded present (Zuboff 2018). Similarly, Franco 'Bifo' Berardi sees in this a determinist trap, where what is possible necessarily becomes what is probable (Berardi 2020). He also locates this phenomenon's cultural antecedent in what he calls the "slow cancelling of the future", arguing that the disappearance of the belief that "things are going to get better" caused by the disappearance of the cultural avant-garde and the arrival of neoliberalism (Berardi 2011 p. 13).

3.1. The model is the message

Whilst the above describes the mechanics that limit the productive capacity of a given model, the effects of this on visual culture at large are of course complicated by the existence of an ecosystem of models. Given their unique training data and differential model architecture, each application has its own particular probabilistic, and therefore aesthetic, character or bias. Since in the discourse on machine learning the term "bias" connotes something negative, or a quality that in other contexts would be avoided, the developers of the Midjourney suite of models term it "their model's house style", an expression now picked up by others (Manovich 2023).

This ecosystem is divided into an oligopoly of commercial platforms including models from companies like Midjourney, Open AI and Adobe and a long tail of open-source models based on the Stable Diffusion platform. Since the method detailed here targets an individual model, it is worthwhile discussing some of these models and the mechanism through which their house styles are encoded into the models.

Midjourney relies heavily on user feedback (Midjourney 2024) for aesthetic optimisation. Users express aesthetic preferences for choices of images through an application feature called "pair rankings", a method where users are asked to vote on which of a pair of images they prefer. Users are incentivised to participate in this through a reward mechanism where they can earn access to higher tier computation resources (Midjourney 2024b). This data is then used as part of a reinforcement learning from human feedback (RLHF) stage in the model's fine-tuning.

As an open-source tool rather than a platform Stable Diffusion does not have access to its user base's preferences in the same way that Midjourney does. Instead, Stable Diffusion's aesthetic optimisation relies on a smaller subset of the LAION-5B dataset (Schuhmann et al. 2022). LAION Aesthetics2 was created using a series of linear classifier models trained to predict aesthetic rating based on human preferences. These classifiers were themselves trained on data provided by click workers asked to rate an image based on "how much they liked the look of it" out of ten.

The impact of these 'house styles' on visual culture is hard to measure empirically. Given their capacity to produce manifold compositions in myriad styles, proficient users can navigate away from a default style with prompts that reference other 'styles', but this exists on a continuum where many images generated in non-default 'styles' do not shed all elements of the house style.

Whilst the advanced user might be able to create images with as greater degree of novelty, I would argue that the house style is the definitive aesthetic of a new but increasingly pervasive kind of internet content; a particular kind of AI generated spam. that circulates on social media platforms, especially Facebook3. Coined "AI Slop" by an anonymous poster on the social media platform X, whose account was later suspended, it was cited in Simon Willison's blog (Willison 2024) and popularised through an article in the New York Times (Hoffman 2024). In its visual form, images in a range of styles have been labelled as AI Slop, but the dominant style might be described as "ultra-realistic game-engine rendering". This is characterised by subjects that are realistically depicted, clearly computer generated, but neither cartoonish nor photorealistic. They subjects are often also lit in a kind of studio lighting that looks mismatched with the content of the image.

Whilst it's difficult to measure the global variation in synthetic images, there have been several studies that look at variation in synthetic text outputs vis-à-vis their human-produced counterparts. Doshi and Hauser (Doshi and Hauser 2024) study the production of "story ideas", short proto-narratives that could conceivably be worked up into larger stories, created by both humans working with language models and humans working unaided. They found that whilst the variation of the output corpus on the individual level was greater when the participant was using a language model, they found that variation decreased on the global level when the language model was used. Castro et al study individuals’ preferences for outcomes of text-based tasks completed with and without language models (Castro, Gao, and Martin 2023). Their study shows that outputs produced by users completing tasks with language models are less unique than that which a user would have produced without using it. They confirm this at the population level, where the AI-generated output distribution has a lower variance than the users’ preference distribution.

Whilst we can infer that what is true for a generative language model should hold true for a generative image model, there are variables that would make an equivalent study difficult to pursue. Unlike an LLM, which is a single-modality model, meaning that users who have the ability to create inputs could also create counterfactual outputs, most users of generative image models were not previously proficient image makers by some other means and therefore do not have the capacity to make and manifest counterfactual images to the model.

4. Practical Implementation

As discussed above, Shumailov et al investigate data-model feedback loops in context of Variational Autoencoders, Gaussian Mixture Models and (Shumailov et al. 2023 & Shumailov et al. 2024). The experiments that lead to the application of this method to the research questions outlined above emerged from attempts to repeat this studies on diffusion models.

Given the resources required to train these enormous foundation models, this method does not propose to restart the process of training. Instead, it proposes to 'fine-tune' the model using a technique employed by many open-source projects called Low Rank Adaption (Hu et al. 2021). Rather than updating the weights of the models directly, a LoRa model consists of additional layers that are separately trained and inserted in between the original layers of the model when running inference. Through simple arithmetic operations, this "tunes" the weights of the model towards a different set of vectors (using a loss calculated against the new training data), instead of manipulating them directly. This results in a much more memory efficient “plug-in” model that can be used in conjunction with the original model rather than replacing it. By iteratively training LoRa models, using a dataset of images produced using the previous LoRa, we can simulate the model collapse without retraining the base model.

Due to its cultural significance, being arguably the most popular text-to-image tool, Midjourney was selected as the initial target for this project. As a private and proprietary model, the kind of manipulation undertaken in this project is not possible with the version of Midjourney that is accessed through the Discord server. Instead, a version of the open-source diffusion model, Stable Diffusion, that has been retrained to mimic the style of Midjourney was used. Specifically, the Midjourney Mimic LoRa (AndrexSel 2024), created by a pseudonymous user of the Generative AI image synthesis community and platform CivitAI, was used by merging it into the Stable Diffusion XL base model to produce a new base checkpoint.

The images generated using each LoRa to seed the next round are produced under the same conditions during each cycle. Having conducted initial experiments with and without prompts, it's clear that there are particular differences in the kinds of images produced and the results this has on the process. During an exploratory experiment, text prompts were not used in the production of the images for each dataset (i.e. the diffusion process was not "guided" by a text prompt). Working under the assumption that the success of this method was in introducing the least amount of additional external direction or input into the process, this approached was adopted in order not to bias the model towards subjects and composition supplied arbitrarily by the researcher. However, it's clear that the images produced without text prompts have a particular composition that is not representative of the kinds of images normally produced using these tools. Without being directed towards rendering a particular subject, the images generated without text prompts tend to produce patterned or chaotic compositions. Many feature dystopian scenes of abandoned decaying urban environments or textural composition of overgrown woodland or rainforest. Examples of the output of these models can been see in Figure 1.

Figure 1: Composite of outputs from early Model Distillation studies

In the second attempt, prompts were used to produce the training images. The set of prompts used here was a subset of the JourneyDB dataset, a collection of 4 million prompt, image and caption samples (Sun et al. 2023). The text prompts curated in JourneyDB were scraped directly from the publicly accessible Midjourney Discord server, a popular interface for interacting with the model. For this experiment, a filtered subset of 2,000 unique text prompts was created. Multiple factors were involved in the size selection of the dataset. Whilst a larger dataset would likely have produced more accurate results, the project was limited by access to computational resources. These prompts were then used to generate the images produced at each round.

Figure 2: Recursive training flowchart

To train the LoRa models, the training software Koyha (bmaltais 2024) was used. Koyha is a GUI interface for the LoRa training architecture, allowing users to parameterise the processes prior to running in a more user-friendly manner. Prior to starting the training, a captioning algorithm was run on the training images to provide a second modality for image understanding. Before each training cycle, the latest LoRa model is used in conjunction with the base model to generate another training set of images using the same set of text prompts. These images become the training images for the next cycle, and the process repeats. At each stage, the learning rate of the LoRa training was lowered to prevent the model from collapsing too early.

Figure 3: Images generated using prompts from the Midjourney DB dataset at each step in the model's recursive training

Due to the change in modality, the statistical methods used to validate the hypothesis of the studies by Shumailov et al did not port to diffusion models. However, it was visually apparent that a similar effect was induced. From the results of the second full experiment using Midjourney, we can see that the colour palette and textural patina decrease in variation with successive rounds of recursive training. If the effects we see here are caused by the same mechanisms as Shumailov's study, we can say that the images generated using a given prompt shift in the models' latent distribution towards a central mean. It is my argument that this subset, around a median point in the models' narrowed distribution, is an amplified representation of the target models' 'house style'.

Model Distillation

In addition to exploring the “curse of recursion” in synthetic image generators, I am also more broadly interested in the use of feedback and amplification as strategies in artistic practice that reveal how systems mediate signals.

In this way, the method outlined above reimagines the mechanism of model collapse as a process of refinement, or distillation. Conceived as such, the process can be likened to that of Alvin Lucier's 1969 composition and sound art piece, "I Am Sitting in a Room"4, a canonical example of process music that explores the relationship between sound, space, and signal processing. In the piece, the performer is instructed to read a text into a microphone and record it. The recording is then played back into the space and re-recorded repeatedly, gradually transforming the sound from speech into the pure resonant frequencies characterising the architecture and material qualities of the performance space.

In a piece titled "VIDEO ROOM 1000" (ontologist 2010), composer and video artist Patrick Liddell, paid homage to Lucier's work by applying his techniques to the algorithms of the then newly transformative media platform YouTube. To make it, Liddel uploaded a video of himself speaking a text similar to Lucier's original to the site, and then proceeded to manually download and re-upload it 1,000 times in sequence over the course of a year. Here, rather than using feedback to explore the acoustic character of a space, Liddell explores the character of the video compression algorithm used by Youtube.

This technique is also incidentally deployed in the creation of Deep Fried Memes, images that demonstrate their own virality through their aesthetic and formal properties. Deep Fried Memes are viral images that have been saturated, sharpened and distorted through subjugation to repeated lossy mediations like screen capture and image compression. Though there are now tools that facilitate the deliberate “frying” of images, like pre-distressing in apparel or furniture, the formal qualities of these grainy, washed-out images originally gained aesthetic value as testament to the image’s viral success. That is to say, the more “fried” an image looked, the more it evidenced its own popularity. Some extreme examples even move beyond representation of the original image’s content entirely, instead becoming representations of algorithmic image processing alone. In this way, they become renderings of infrastructure of the internet, like a frottage of the internet’s plumbing.

Figure 4: Deep Fried Meme Examples, Deep Fried Memes. (2017, February 16). Know Your Meme. https://knowyourmeme.com/memes/deep-fried-memes

By superimposing a technologically mediated signal back on itself again and again, we are able to magnify the particular character that mediation has upon the signal; drawing out its artefacts, amplifying the noise or the particular assemblages and phenomena to which it is biased. The methods described above offer a similar strategy for exploring diffusion models.

4.1. Visual Practice as Critique

Whilst the technical and methodological contributions discussed so far are aimed at instrumental ends, such as more informed engagement with the house styles of these models, the outputs of the practice-based component of the project, where I have used the collapsed model to produce a series of new image works, stand as a more critical intervention.

The political theory of "left-accelerationism' (Srnicek and Williams 2015 & Mackay and Avanessian 2014) can be quite instructive in this regard. Its proponents argue that strategically hastening the development of certain technologies, such as AI, will expose and exacerbate the contradictions inherent in the capitalist mode of production identified by Marx, and help to bring about its eventual collapse. Porting these ideas to this project, I propose that by accelerating the effects of the data-model feedback loop, a condition that would eventually technically produce homogeneity, the images produced with the collapsed model aim to thwart this aesthetic stagnation, by exposing the most exaggerated manifestation of that stagnation's formal qualities.

In this way the formal qualities of the images can be also understood as a kind of speculative design, as if the images are drawn using a future sclerotic neural palette, ossified by the autophagous response of the model to its own data. By revealing the qualities of images produced by Midjourney that currently only subtly present, the images seek to tune the consciousness towards spotting them, hastening their demise and quickening the public's appetite for something different.

5. Conclusion

This text details a practice-based critical engineering project called Midjourney Distilled. It introduces an experimental technical method that supports investigations into the effects of data-model feedback loops in generative computer vision systems and offers a commentary on a practice-based exploration of the resultant models as a creative tool.

The method involves training a target text-to-image diffusion model on synthetic images produced by previous generations of the model, experimentally accelerating the process of data-model feedback, bringing the target model near collapse. The implementation of this method in the study detailed above reproduces, in a qualitative fashion, the results of similar research that investigates data-model feedback loops in language models. It supports their findings that training on recursive data results in successive models tending towards unimodality, represented in this study as the production of aesthetic homogeneity in image outputs.

Furthermore, the method produces images that contribute to more informed use of the target model for creative practitioners. By collapsing the model around the center of the model's initial distribution, it reveals the particular formal qualities to which the target model is biased, its house style. Armed with a clearer sense of what these qualities are, the creative practitioner has a keener eye for steering the model away from this house style.

The new digital image works produced using the method should be read as provocations, offering a commentary on how data-model feedback loops might contribute to an arguably already apparent aesthetic homogenisation. However, this argument is complicated by the experience of working with the model as a practitioner. In attempting to produce a commentary on likely homogenising effects of synthetic image generators on aesthetics, I was ironically able to make images that I, as a practitioner, found to be novel and interesting.

This demonstrates that whilst the logical limitations of machine learning at large, as noted by Pasquenelli and Joler, set a hard limit on the production of "the new", there is still potential for aesthetic value from these models. Despite advances towards automated generalised image production (from subject-specific GANs), it is still through the creative process of curating and training bespoke models that produces new image works of novel aesthetic value. This leaves open questions for how the open-source community can flourish safely, but it is clear that the only way to avoid homogenising effects of both house styles and the data-model feedback loop is to grow this ecosystem.

References

AndrexSel. 2024. ‘Midjourney Mimic - v1.2 | Stable Diffusion LoRA | Civitai’. https://civitai.com/models/251417/midjourney-mimic.

Berardi, Franco. 2011. After the Future. Edinburgh: AK Press. https://books.google.co.uk/books?id=ATA9AQAAQBAJ.

Berardi, Franco. 2020. ‘Simulated Replicants Forever: Big Data, Engendered Determinism, and the End of Prophecy’. In Big Data: A New Medium?, edited by N. Lushetich. Taylor & Francis. https://books.google.co.uk/books?id=vhAHEAAAQBAJ.

bmaltais. 2024. ‘Bmaltais/Kohya_ss’. Python. https://github.com/bmaltais/kohya_ss.

Castro, Francisco, Jian Gao, and Sébastien Martin. 2023. ‘Human-AI Interactions and Societal Pitfalls’. arXiv. https://doi.org/10.48550/arXiv.2309.10448.

‘Deep Fried Memes’. 2017. Know Your Meme. 16 February 2017. https://knowyourmeme.com/memes/deep-fried-memes.

Doshi, Anil R., and Oliver P. Hauser. 2024. ‘Generative Artificial Intelligence Enhances Creativity but Reduces the Diversity of Novel Content’. arXiv. https://doi.org/10.48550/arXiv.2312.00506.

Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, et al. 2020. ‘The Pile: An 800GB Dataset of Diverse Text for Language Modeling’. arXiv. https://doi.org/10.48550/arXiv.2101.00027.

Goetze, Trystan S. 2024. ‘AI Art Is Theft: Labour, Extraction, and Exploitation, Or, On the Dangers of Stochastic Pollocks’. arXiv. https://doi.org/10.48550/arXiv.2401.06178.

Gunasekar, Suriya, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, et al. 2023. ‘Textbooks Are All You Need’. arXiv. https://doi.org/10.48550/arXiv.2306.11644.

Hoffman, Benjamin. 2024. ‘Is Slop A.I.’s Answer to Spam?’ New York Times, 11 June 2024. https://archive.is/qtjBD.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. ‘Training Compute-Optimal Large Language Models’. arXiv. https://doi.org/10.48550/arXiv.2203.15556.

Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. ‘Lora: Low-Rank Adaptation of Large Language Models’. arXiv Preprint arXiv:2106.09685.

jeff [@jeffreyhuber]. 2023. ‘Gotta Love Those Pre-2023 Low-Background Tokens Https://T.Co/3NlyNd3luA’. Tweet. Twitter. https://twitter.com/jeffreyhuber/status/1732069197847687658.

Longpre, Shayne, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, et al. 2024. ‘Consent in Crisis: The Rapid Decline of the AI Data Commons’. arXiv. http://arxiv.org/abs/2407.14933.

Mackay, Robin, and Armen Avanessian. 2014. # Accelerate: The Accelerationist Reader. MIT Press.

Manovich, Lev. 2023. ‘Chapter 5: Seven Arguments about AI Images and Generative Media’. In Artificial Aesthetics: A Critical Guide to Artificial Intelligence, Media and Design. https://www.academia.edu/101256302/Seven_Arguments_about_AI_Images_and_Generative_Media_Chapter_5_of_Artificial_Aesthetics_.

Midjourney. 2024a. ‘Earn Free Hours Ranking Images on Midjourney’. 2024. https://docs.midjourney.com/docs/free-hours.

———. 2024b. ‘Personalization (--p)’. Midjourney (blog). 12 June 2024. https://updates.midjourney.com/personalization/.

ontologist, dir. 2010. VIDEO ROOM 1000 COMPLETE MIX -- All 1000 Videos Seen in Sequential Order! https://www.youtube.com/watch?v=icruGcSsPp0.

Park, Ji-Hoon, Yeong-Joon Ju, and Seong-Whan Lee. 2024. ‘Explaining Generative Diffusion Models via Visual Analysis for Interpretable Decision-Making Process’. Expert Systems with Applications 248 (August):123231. https://doi.org/10.1016/j.eswa.2024.123231.

Pasquinelli, Matteo, and Vladan Joler. 2021. ‘The Nooscope Manifested: AI as Instrument of Knowledge Extractivism’. AI & SOCIETY 36 (4): 1263–80. https://doi.org/10.1007/s00146-020-01097-6.

Penedo, Guilherme, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. ‘The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only’. arXiv. https://doi.org/10.48550/arXiv.2306.01116.

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. ‘Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer’. Journal of Machine Learning Research 21 (140): 1–67.

Schuhmann, Christoph, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, et al. 2022. ‘LAION-5B: An Open Large-Scale Dataset for Training next Generation Image-Text Models’. arXiv. https://doi.org/10.48550/arXiv.2210.08402.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. 2023. ‘The Curse of Recursion: Training on Generated Data Makes Models Forget’. arXiv. https://doi.org/10.48550/arXiv.2305.17493.

Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. 2024. ‘AI Models Collapse When Trained on Recursively Generated Data’. Nature 631 (8022): 755–59. https://doi.org/10.1038/s41586-024-07566-y.

Soldaini, Luca, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, et al. 2024. ‘Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research’. arXiv. https://doi.org/10.48550/arXiv.2402.00159.

Srnicek, Nick, and Alex Williams. 2015. Inventing the Future: Postcapitalism and a World Without Work. Verso Books.

Sun, Keqiang, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, et al. 2023. ‘JourneyDB: A Benchmark for Generative Image Understanding’. arXiv. https://doi.org/10.48550/arXiv.2307.00716.

Villalobos, Pablo, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. 2024. ‘Will We Run out of Data? Limits of LLM Scaling Based on Human-Generated Data’. arXiv. https://doi.org/10.48550/arXiv.2211.04325.

Willison, Simon. 2024. ‘Slop Is the New Name for Unwanted AI-Generated Content’. 8 May 2024. https://simonwillison.net/2024/May/8/slop/.

Zuboff, Shoshana. 2018. The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power. Profile Books.


  1. https://twitter.com/jeffreyhuber/status/1732069197847687658↩︎

  2. https://laion.ai/blog/laion-aesthetics/↩︎

  3. For an archive of examples see: https://x.com/facebookaislop?lang=en↩︎

  4. https://en.wikipedia.org/wiki/I_Am_Sitting_in_a_Room↩︎


01
02
03
04
05
06
07
08
09

Model Distillation: Collapse as a Method of Revealing Algorithmic ‘House Style’