Images are classified only by URL? #3572

foolip · 2024-02-12T02:06:05Z

I have been poking at https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2022/media/bytes_and_dimensions_by_format.sql to get an updated view of quality distributions in the wild.

I happened to look for 'heif' images and was surprised how many I found. It turns out that for example https://gaijincph.dk/ serves https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/1e59c431-1198-4859-bebd-769d37d1a975_m.heic

That gets classified by pithyType() as 'heif' because the URL ends with '.heic'. However, it's actually a JPEG.

It seems like only the URL is used in fact, because of this call here:

almanac.httparchive.org/sql/2022/media/bytes_and_dimensions_by_format.sql

Line 112 in ff9fd22

resourceFormat: pithyType({ contentType: d.mimeType, url: d.url })

There is no mimeType in the data, at least not in the httparchive.pages.2024_01_01_desktop data. Here's what I unpacked from payload and a few nested JSON objects for https://gaijincph.dk/:

[
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": false,
    "hasHeight": false,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/logo.png?v58288",
    "totalCandidates": 1,
    "altAttribute": "GAIJIN logo",
    "clientWidth": 150,
    "clientHeight": 134,
    "naturalWidth": 2097,
    "naturalHeight": 1598,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 2097,
    "approximateResourceHeight": 1598,
    "byteSize": 125672,
    "bitsPerPixel": 0.3000221426043403,
    "computedSizingStyles": {
      "width": "auto",
      "height": "auto",
      "maxWidth": "150px",
      "maxHeight": "none",
      "minWidth": "auto",
      "minHeight": "auto"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "both",
      "height": "intrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/e6680b7e-a494-49d3-b8dd-027338d28566_m.jpg",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of The Full Gaijin Experience",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 826,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 826,
    "byteSize": 283244,
    "bitsPerPixel": 3.8101156846919557,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/1e59c431-1198-4859-bebd-769d37d1a975_m.heic",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of Tasting menu",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 960,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 960,
    "byteSize": 214586,
    "bitsPerPixel": 2.4836342592592593,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/73c94994-f80a-4f1c-b524-44d3e80e28ee_m.heic",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of A la carte",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 720,
    "naturalHeight": 900,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 720,
    "approximateResourceHeight": 900,
    "byteSize": 248070,
    "bitsPerPixel": 3.0625925925925928,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": true,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/29fd156a-2b46-4cb6-a5a3-0e481f66aaba_m.png",
    "totalCandidates": 1,
    "heightAttribute": "100%",
    "widthAttribute": "100%",
    "altAttribute": "Picture of Private Dining",
    "clientWidth": 411,
    "clientHeight": 411,
    "naturalWidth": 709,
    "naturalHeight": 540,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 709,
    "approximateResourceHeight": 540,
    "byteSize": 22605,
    "bitsPerPixel": 0.47233975865851746,
    "computedSizingStyles": {
      "width": "100%",
      "height": "100%",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "100%",
      "minHeight": "100%"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "extrinsic"
    },
    "reservedLayoutDimensions": false
  },
  {
    "hasSrc": true,
    "hasAlt": true,
    "isInPicture": false,
    "hasCustomDataAttributes": false,
    "hasWidth": true,
    "hasHeight": false,
    "url": "https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/coverimage_l.jpeg",
    "totalCandidates": 1,
    "widthAttribute": "100%",
    "altAttribute": "",
    "clientWidth": 918,
    "clientHeight": 918,
    "naturalWidth": 1200,
    "naturalHeight": 1200,
    "hasSrcset": false,
    "hasSizes": false,
    "currentSrcDensity": 1,
    "approximateResourceWidth": 1200,
    "approximateResourceHeight": 1200,
    "byteSize": 61206,
    "bitsPerPixel": 0.34003333333333335,
    "computedSizingStyles": {
      "width": "100%",
      "height": "auto",
      "maxWidth": "none",
      "maxHeight": "none",
      "minWidth": "auto",
      "minHeight": "auto"
    },
    "intrinsicOrExtrinsicSizing": {
      "width": "extrinsic",
      "height": "intrinsic"
    },
    "reservedLayoutDimensions": false
  }
]

Since the number of bytes and the decoded width and height are known, the decoder that was actually used should in principle be knowable.

The text was updated successfully, but these errors were encountered:

rviscomi · 2024-02-12T21:41:13Z

cc @eeeps

eeeps · 2024-02-13T16:57:24Z

Those URLs are returned with the following HTTP header:

Content-Type: application/octet-stream

That fails the Regex test here, so we fall back to looking at the file extension, at the place you identified.

I agree that the crawler knows more than we can, by looking at URLs and HTTP headers, and it would be nice to have the actual decoded type exposed to catch cases like this (or, failing that, at least to get a sense of how common such cases are). It might actually already be, because of the work Pat Meenan did in 2022 to get the actual image resources run through ImageMagick and a bunch of things reported (see the note in the README https://github.com/HTTPArchive/almanac.httparchive.org/tree/ff9fd22f0489469ebf3254de6072f63cf086407a/sql/2022/media#notes-for-2023). I'll try to dig in later today to see why we didn't use that here.

eeeps · 2024-02-13T17:00:59Z

That was fast! We didn't get to use any of the ImageMagick data here because this query is working from <img>s found in the markup, rather than from HTTP requests. See my note in the readme about my failure to join requests up to loaded <img> resources, and how that was my number one TODO going forward.

foolip · 2024-02-13T17:11:28Z

@eeeps is any of the code using ImageMagick running in the current crawls? I've been thinking about exactly that these past few days, if we could run identify -format "%Q\n" for JPEG files in particular to understand the quality in a different way. I assumed that none of the resources are on disk so this would be a big lift, but it sounds like some of the work has already been done?

foolip · 2024-02-13T17:16:29Z

Is the $._image_details data being written to anything in BigQuery yet? If not, is there a sample of that from the raw crawl data that I could look at? I'm interested to know what kind of stuff is in there and if it would help.

rviscomi · 2024-02-13T17:21:31Z

Yeah here's a way to cheaply (355.91 MB) query a sample of the $._image_details object:

SELECT
  url,
  JSON_QUERY(payload, '$._image_details') AS image_details
FROM
  `httparchive.all.requests` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE
  date = '2024-01-01' AND
  client = 'mobile' AND
  is_root_page AND
  type = 'image'
LIMIT
  10

Sample result

{
    "detected_type": "jpeg",
    "metadata": {
        "ExifTool": {
            "ExifToolVersion": 12.52
        },
        "File": {
            "FileSize": "137 kB",
            "FileType": "JPEG",
            "FileTypeExtension": "jpg",
            "MIMEType": "image/jpeg",
            "Comment": "CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 80\n",
            "ImageWidth": 800,
            "ImageHeight": 800,
            "EncodingProcess": "Baseline DCT, Huffman coding",
            "BitsPerSample": 8,
            "ColorComponents": 3,
            "YCbCrSubSampling": "YCbCr4:2:0 (2 2)"
        },
        "JFIF": {
            "JFIFVersion": 1.01,
            "ResolutionUnit": "inches",
            "XResolution": 96,
            "YResolution": 96
        },
        "Composite": {
            "ImageSize": "800x800",
            "Megapixels": 0.64
        }
    },
    "magick": {
        "baseName": "10710.94",
        "format": "JPEG",
        "formatDescription": "JPEG",
        "mimeType": "image/jpeg",
        "class": "DirectClass",
        "geometry": {
            "width": 800,
            "height": 800,
            "x": 0,
            "y": 0
        },
        "resolution": {
            "x": 96,
            "y": 96
        },
        "printSize": {
            "x": 8.33333,
            "y": 8.33333
        },
        "units": "PixelsPerInch",
        "type": "TrueColor",
        "baseType": "Undefined",
        "endianness": "Undefined",
        "colorspace": "sRGB",
        "depth": 8,
        "baseDepth": 8,
        "channelDepth": {
            "red": 8,
            "green": 8,
            "blue": 1
        },
        "pixels": 1920000,
        "imageStatistics": {
            "Overall": {
                "min": 0,
                "max": 255,
                "mean": 65.7495,
                "median": 35.6667,
                "standardDeviation": 79.6716,
                "kurtosis": 0.0952899,
                "skewness": 1.18423,
                "entropy": 0.835339
            }
        },
        "channelStatistics": {
            "red": {
                "min": 0,
                "max": 255,
                "mean": 58.5843,
                "median": 15,
                "standardDeviation": 82.5814,
                "kurtosis": 0.438709,
                "skewness": 1.4063,
                "entropy": 0.805027
            },
            "green": {
                "min": 0,
                "max": 255,
                "mean": 60.4429,
                "median": 30,
                "standardDeviation": 76.1642,
                "kurtosis": 0.756071,
                "skewness": 1.40876,
                "entropy": 0.838816
            },
            "blue": {
                "min": 0,
                "max": 255,
                "mean": 78.2214,
                "median": 62,
                "standardDeviation": 80.2692,
                "kurtosis": -0.494016,
                "skewness": 0.80254,
                "entropy": 0.862173
            }
        },
        "renderingIntent": "Perceptual",
        "gamma": 0.454545,
        "chromaticity": {
            "redPrimary": {
                "x": 0.64,
                "y": 0.33
            },
            "greenPrimary": {
                "x": 0.3,
                "y": 0.6
            },
            "bluePrimary": {
                "x": 0.15,
                "y": 0.06
            },
            "whitePrimary": {
                "x": 0.3127,
                "y": 0.329
            }
        },
        "matteColor": "#BDBDBD",
        "backgroundColor": "#FFFFFF",
        "borderColor": "#DFDFDF",
        "transparentColor": "#00000000",
        "interlace": "None",
        "intensity": "Undefined",
        "compose": "Over",
        "pageGeometry": {
            "width": 800,
            "height": 800,
            "x": 0,
            "y": 0
        },
        "dispose": "Undefined",
        "iterations": 0,
        "compression": "JPEG",
        "quality": 80,
        "orientation": "Undefined",
        "properties": {
            "comment": "CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 80\n",
            "date:create": "2024-01-13T04:33:02+00:00",
            "date:modify": "2024-01-13T04:33:02+00:00",
            "date:timestamp": "2024-01-13T04:34:18+00:00",
            "jpeg:colorspace": "2",
            "jpeg:sampling-factor": "2x2,1x1,1x1",
            "signature": "0d0e8995e2aae98c15e1e2bc69c8f988423e022cf4055d72e9752a457a53a440"
        },
        "tainted": false,
        "filesize": "136972B",
        "numberPixels": "640000",
        "pixelsPerSecond": "40.1272MB",
        "userTime": "0.020u",
        "elapsedTime": "0:01.015"
    }
}

eeeps · 2024-02-13T17:45:25Z

As per usual, Rick beat me to it. Different (older?) flavor:

SELECT
  url,
  JSON_QUERY(payload, '$._image_details') as image_details
FROM `httparchive.requests.2023_12_01_mobile` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE JSON_QUERY(payload, '$._image_details') IS NOT NULL

results

foolip · 2024-02-14T01:39:43Z

Thank you @rviscomi and @eeeps, my joy is boundless! I will play around with this.

foolip · 2024-02-14T11:09:17Z

After some terrible queries and intermediate tables I have a first result:

Is this the right repo to ask questions like why is _image_details sometimes missing? and other things I'll need to figure out to refine this?

rviscomi · 2024-02-14T16:35:01Z

Yeah I think here is fine

cc @pmeenan

eeeps · 2024-02-14T16:39:54Z

@foolip not sure about venue (if I have a discussion that might require a chattier exploration, I generally start it in the HTTP Archive Slack), but @pmeenan is the person to ask about missing _image_details.

Interesting chart! I do worry though... the "quality" reported by ImageMagick's identify for JPEGs, like most 0-100 quality scales used by encoders, is arbitrary and IM- and JPEG-specific. It's based on the quantization tables IM finds in the file, which will mostly correlate with what people think "quality" means (a subjective evaluation of "how good" the output looks when compared with the input), but not at all exactly. Worse, this value doesn't line up with other formats or other tools. People expect "quality 80" to mean the same thing everywhere. It does not, even for tools that are only dealing with JPEGs, and once you're talking other formats, you're in another universe.

That said... the number of quality: 100 JPEGs here -- wow. Antipattern!

pmeenan · 2024-02-14T16:51:18Z

If you have examples for where it is missing I can take a look. It could happen if for some reason the image response body isn't available or the code that detects the image type by looking at the header bytes doesn't recognize it.

heif is definitely not detected but the others should be reasonably up to date.

foolip mentioned this issue Feb 13, 2024

Missing image codecs whatwg/mimesniff#143

Open

foolip mentioned this issue Mar 3, 2024

Media 2024 #3596

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images are classified only by URL? #3572

Images are classified only by URL? #3572

foolip commented Feb 12, 2024

rviscomi commented Feb 12, 2024

eeeps commented Feb 13, 2024

eeeps commented Feb 13, 2024 •

edited

Loading

foolip commented Feb 13, 2024

foolip commented Feb 13, 2024

rviscomi commented Feb 13, 2024 •

edited

Loading

eeeps commented Feb 13, 2024 •

edited

Loading

foolip commented Feb 14, 2024

foolip commented Feb 14, 2024

rviscomi commented Feb 14, 2024

eeeps commented Feb 14, 2024 •

edited

Loading

pmeenan commented Feb 14, 2024

Images are classified only by URL? #3572

Images are classified only by URL? #3572

Comments

foolip commented Feb 12, 2024

rviscomi commented Feb 12, 2024

eeeps commented Feb 13, 2024

eeeps commented Feb 13, 2024 • edited Loading

foolip commented Feb 13, 2024

foolip commented Feb 13, 2024

rviscomi commented Feb 13, 2024 • edited Loading

eeeps commented Feb 13, 2024 • edited Loading

foolip commented Feb 14, 2024

foolip commented Feb 14, 2024

rviscomi commented Feb 14, 2024

eeeps commented Feb 14, 2024 • edited Loading

pmeenan commented Feb 14, 2024

eeeps commented Feb 13, 2024 •

edited

Loading

rviscomi commented Feb 13, 2024 •

edited

Loading

eeeps commented Feb 13, 2024 •

edited

Loading

eeeps commented Feb 14, 2024 •

edited

Loading