Oh my God. This really freaked me out… like what the hell.

      • j4k3@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        14 days ago
        The error is due to how the model loader code works and is made for general use.

        There is also a high probability that the prompt contains an error like spelling or grammar. All publicly available model loaders are processing the prompt for the whole image. Models are actually trained for object permanence, but implementing this in a way people can understand for general use is too hard of a problem for anyone to have solved so far. There is usually some kind of loose association between descriptions of individual objects and the object itself based on the order of terms in the prompt, but these associations are not absolute. So like, if you prompt “man in a red shirt”. You are only saying that the result contains a man, a shirt, and the color red. The man can be naked, the ground can be red, and the shirt can be on a clothes line. There is a way to associate these objects deterministically in the Transformers code library, but this is hard to implement for general use because of the cascade and topological ordering specificity, like red is a property of tee shirt, and tee shirt is a property of man, and man is a property of primary image subject, and primary image subject is a property of medium view with depth of field focus.

        Without object permanence defined, we set the config value low to tell the model to mostly be creative and only loosely follow the prompt. The looseness largely means that there are random areas of the prompt that are focused on at different points in generation. If there are any spelling, grammar, or poorly defined areas in the prompt, they might get focused on randomly. The extra focus causes the model to be like WTF was I thinking and make a major change that does not make sense at first. My guess here is that this is potatoes=face.

        Video is basically text to image, but with some details inserted into the iterative process near the end. When I generate text to image stuff on my custom ComfyUI setup, I see what looks like ultra low resolution video like clips of action happening. I find this part of image gen very interesting because tweaking it can cause all kinds of interesting effects. If you have access to a negative prompt, try adding (Shadow, chuck), cross, twist, sophist, sadist and yes capitalization is intentional.

        • AliceOPMA
          link
          fedilink
          English
          arrow-up
          2
          ·
          14 days ago

          This was so interesting ty!