In recent years I have returned to painting on canvas. It improves seeing. Extends focus beyond daily analytical work in IT. There is also an emotional aspect, some truth to how Winston Churchill famously used painting as a therapeutic escape from his "Black Dog".

Furthermore, artistic creativity and mathematical thinking seems to be two sides of the divergent-thinking coin.

Having shared my thoughts on AI Sovereignty, let's move through the rewarding pursuits of philosophy and artistic painting before landing on the raw reality of technical algorithms and a concrete application at the later section of this text.

Chasing the human insight

Jumping into the ever-present question: what is human insight? What constitutes human spark, judgment and creativity? Well, while the direct answer might not appear here, let's use these questions as an excuse and try to sketch a path to find out by using some of the new tools that were hardly available just a couple of years ago.

Would that be even feasible to search for human spark, that some believe to be localized somewhere external to the known mechanistic realm, with mechanistic methods? Are there any other methods?

Descartes-rooted culture makes many people believe in a kind of mind-body duality. This group concludes that human spark or consciousness can't have a scientific explanation (I am putting spark, insight, Artificial General Intelligence, and consciousness into one bag from here). Thus cannot be reproduced in full by a machine.

Some, like Roger Penrose in his “Shadows of the Mind”, state that it can be explained but it is non-computational and we still have insufficient knowledge of the world to do so.

Others, called computationalists, support the view that Gödel's incompleteness theorems or Turing's halting problem do not limit from constructing a complete human-alike mechanistic mind, AGI. This view is shared by influential academics in the field like Scott Aaronson, or educated influencers like Sean Carroll, or highly regarded thinkers like Daniel Dennett. The idea is that the current understanding of the word, knowledge of physics, emergence, computation etc, is enough to explain the human mind, we “just” need to dwell through many thick layers of complexity.

On the more extreme end of the spectrum are tech enthusiasts who bet on AGI emerging just around the corner, much like it was advertised for Chat GPT in 2025. For now, this seems only likely to happen by watering down the definition of AGI itself to fit into corporate fiscal year spreadsheets.

By the way, the story around (anti-)computationalist arguments is one we enjoy discussing at the Complexity Explorers Krakow group, to which I warmly invite you at cekrk.com.

Limits of AI. The Gödelian argument. Complexity Explorers Krakow #2 | PDF

That said, strong investment into AI resulted in development of new techniques augmented into software. The ongoing progress strips away competences regarded as human only one by one. Take a view:

  • Beating world champion in board game GO by AlphaGo (2016) showing strategic “intuition” instead of relying on brute force computation.
  • Discovery of drug Halicin (named after Hal 9000) in 2020, by MIT researchers, where an AI model screens 100 million chemical compounds and finds a molecule that physically looked nothing like a traditional antibiotic but was lethal to drug-resistant bacteria. Showing AI capable of scientific serendipity, conceptual discovery.
  • Digital art “Théâtre D'opéra Spatial”, by Jason M. Allen receives award in digital arts from human judges, and sparks discussion (2022).
  • In 2024, AlphaGeometry matched a human gold medalist, solving 25 of 30 Olympiad-level geometry problems, proving AI's ability in symbolic deductive mathematical reasoning.
  • The beginning of 2026 marked with progress in AI for Math research like solving Erdős problems and beyond.

Théâtre D'opéra Spatial, Jason M. Allen 2022; Midjourney, Public domain, via Wikimedia Commons

Standing on the shoulders of AI models

RealSpark is a self contained web application designed to authenticate and analyze traditional artwork (such as oil paintings, watercolors, acrylics) using a suite of compact AI models. The prototype distinguishes between human-created art and AI-generated content. It is also an example of running a workflow of cooperating AI models to drive the planned analysis to a conclusion. There are also some properties gathered along analysis as candidates for feature engineering.

All image processing AI models in the application are built along ViT (Vision Transformer) architecture which was borrowed from NLP and become an alternative to Convolutional Neural Networks (CNNs) in Image Recognition. Vision transformers showed better scalability in learning and ability to spot long-range dependencies in images, thanks to self-attention mechanisms (though this comes with higher computational demands for higher-resolution images). To execute inference, the application relies on the Hugging Face transformers library.

Classifier for AI Art Authenticator

The problem of detecting whether an image is AI-generated is considered extremely hard. Studies show human accuracy is often around 60-63%. There is kind of an arms race between image creators and verifiers. Art itself is also progressing. One might ask, why not start painting humans with a proverbial six-fingered hand or simulate other AI artifacts into human made art?

There is a high risk of having false positives. Real human art, specifically digital paintings or highly edited photos, would frequently be tagged as AI made.

RealSpark uses the compact ( Ateeqq/ai-vs-human-image-detector) model and returns the probability of the image being AI-generated. The classifier processes images by chopping them into fixed-size square patches and analyzing the relationship between them.

While computationally demanding, it is still the initial step for image authentication. On the next level, classification could be improved by analyzing (e.g., with CNNs) high-frequency noise in the image that is characteristic of AI-generated content (which works for images not compressed by JPEG).

Clustering physical texture

For analysing physical texture, RealSpark hands out to another local ViT model facebook/dinov2-base trained using the DINOv2 method. It excels at fine-grained visual discrimination e.g. of textures or shapes and is a choice for capturing details like brushstrokes, canvas grain, and pixel-level consistency.

DINOv2 is a self-supervised learning technique, meaning it is learned by observing patterns, without providing labels. Its strength is more on how the image is constructed than what the image means.

Art medium labeling

For determining the artistic medium, RealSpark uses CLIP (Contrastive Language-Image Pre-Training) that is optimized for matching images to text. It “understands” the relationship between visual concepts and text. The app inferences with local openai/clip-vit-base-patch32 model.

CLIP aligns text and images in a shared embedding space, meaning it joins semantically the numerical representation of image patches and text. As a zero-shot classifier it requires zero training data, but the application provides a list of text labels at runtime for inference, e.g. "Oil painting", "Watercolor", "Digital art", "Acrylic", etc. Labels are used by the model to create a search space. It is contrastive as the model compares labels with each other to find the most probable one for a given image.

Both DINOv2 texture findings and CLIP labels are merged into one generated description. For example: "Likely Oil (Confidence: 0.92). Texture is highly varied, suggesting complex physical brushwork or impasto."

Object Detection

For Object Detection in the workflow application uses YOLOS (You Only Look at One Sequence) hustvl/yolos-tiny . YOLO looks at the entire image in one single pass and was chosen because of its low resource footprint.

Synthesizing results with local LLM

The final consolidation step of the workflow awaits for all image processing analysis is finished to create a textual description with instruction-tuned LLM google/flan-t5-small

The model tends to suffer from issues typical to a small LLM. At early iterations, there had been several issues with outputs generated by the language model:

  • repetitive output (echoing input)
  • prompt leakage (revealing instruction and parameters)
  • infinite looping (stuck on phrases)
  • hitting context length limits of 512 tokens

To resolve these, the first updates switched from single-shot to few-shot prompting (providing 2-3 input/output examples to implicitly teach desired tone and structure), testing repetition_penalty, no_repeat_ngram_siz parameters.

Output still looked stiff, more like it was just one template for all results, so I moved from greedy search to nucleus sampling (named as top-p parameter) and adjusted temperature as these increase randomness. The difference between both is that with an increase of temperature creativity arises, a broader range of (less expected) words are included in possible next token choice. Whereas nucleus sampling defines a threshold of probability that defines a set of tokens to pick from. Gauging with both is somewhat tricky but you could expect to have more synonyms (higher temperature) without too fantasy out of scope words (top-n < 1).

Debugging ended up with budgeting tokens for prompt (which includes single-shot instruction) leaving a 25% token buffer to prevent hitting the limit and adding some output post processing to strip unwanted parameter names and repetitions.

Agents generate the code, my job is to prove every iteration works

Implementation was driven with LLM code assist agents, which put the requirement to employ Test Driven Development on a new level. Each coding iteration is accompanied with a set of 3 kinds of tests: 1 - unit tests, using Pytest for backend logic (including mocked AI pipelines) and 2 - Playwright for end-to-end browser integration testing. 3 - there are additional pure front-end tests supported by Node.js libraries to ensure that use-case defined views are consistent after changes.

Agents generate most of the code, my job is to review and prove it works. Each iteration needs to have a regression controlled and path for the next change engineered out.

Backend is built with standard FastAPI utilizing Uvicorn web server for async performance and simple DuckDB for storage.

The API model is described using the OpenAPI standard. In early iterations, the specification was also used for generating Javascript SDK (for API to front-end communication). That was ditched after inclusion of Alpine.js, a declarative framework that allows writing logic directly in HTML (similar to Vue or React) but without a build step. For a relatively simple front-end that doesn't need complex state management, it allows you to skip generating boilerplate code altogether.

The user uploads an image, which is immediately processed with jobs running simultaneously (and a 2-minute timeout), with some locks, and joined into a summary. This design is harder to debug and test but makes the User Experience natural as results are displayed live on the screen as soon as they are ready.

At this point, you may feel capable of approaching any problem, such as analyzing human-level competence, starting from intuition or artistic practice, through thought experiments, and then backing it up with a concrete implementation using community-built AI models as experiments. AI-backed development has reached a level now that allows such a solo project to become commonly feasible, opening up a view for the longer distance ahead.


blog comments powered by Disqus