-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Images are classified only by URL? #3572
Comments
cc @eeeps |
Those URLs are returned with the following HTTP header:
That fails the Regex test here, so we fall back to looking at the file extension, at the place you identified. I agree that the crawler knows more than we can, by looking at URLs and HTTP headers, and it would be nice to have the actual decoded type exposed to catch cases like this (or, failing that, at least to get a sense of how common such cases are). It might actually already be, because of the work Pat Meenan did in 2022 to get the actual image resources run through ImageMagick and a bunch of things reported (see the note in the README https://github.com/HTTPArchive/almanac.httparchive.org/tree/ff9fd22f0489469ebf3254de6072f63cf086407a/sql/2022/media#notes-for-2023). I'll try to dig in later today to see why we didn't use that here. |
That was fast! We didn't get to use any of the ImageMagick data here because this query is working from |
@eeeps is any of the code using ImageMagick running in the current crawls? I've been thinking about exactly that these past few days, if we could run |
Is the |
Yeah here's a way to cheaply (355.91 MB) query a sample of the SELECT
url,
JSON_QUERY(payload, '$._image_details') AS image_details
FROM
`httparchive.all.requests` TABLESAMPLE SYSTEM (0.001 PERCENT)
WHERE
date = '2024-01-01' AND
client = 'mobile' AND
is_root_page AND
type = 'image'
LIMIT
10 Sample result{
"detected_type": "jpeg",
"metadata": {
"ExifTool": {
"ExifToolVersion": 12.52
},
"File": {
"FileSize": "137 kB",
"FileType": "JPEG",
"FileTypeExtension": "jpg",
"MIMEType": "image/jpeg",
"Comment": "CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 80\n",
"ImageWidth": 800,
"ImageHeight": 800,
"EncodingProcess": "Baseline DCT, Huffman coding",
"BitsPerSample": 8,
"ColorComponents": 3,
"YCbCrSubSampling": "YCbCr4:2:0 (2 2)"
},
"JFIF": {
"JFIFVersion": 1.01,
"ResolutionUnit": "inches",
"XResolution": 96,
"YResolution": 96
},
"Composite": {
"ImageSize": "800x800",
"Megapixels": 0.64
}
},
"magick": {
"baseName": "10710.94",
"format": "JPEG",
"formatDescription": "JPEG",
"mimeType": "image/jpeg",
"class": "DirectClass",
"geometry": {
"width": 800,
"height": 800,
"x": 0,
"y": 0
},
"resolution": {
"x": 96,
"y": 96
},
"printSize": {
"x": 8.33333,
"y": 8.33333
},
"units": "PixelsPerInch",
"type": "TrueColor",
"baseType": "Undefined",
"endianness": "Undefined",
"colorspace": "sRGB",
"depth": 8,
"baseDepth": 8,
"channelDepth": {
"red": 8,
"green": 8,
"blue": 1
},
"pixels": 1920000,
"imageStatistics": {
"Overall": {
"min": 0,
"max": 255,
"mean": 65.7495,
"median": 35.6667,
"standardDeviation": 79.6716,
"kurtosis": 0.0952899,
"skewness": 1.18423,
"entropy": 0.835339
}
},
"channelStatistics": {
"red": {
"min": 0,
"max": 255,
"mean": 58.5843,
"median": 15,
"standardDeviation": 82.5814,
"kurtosis": 0.438709,
"skewness": 1.4063,
"entropy": 0.805027
},
"green": {
"min": 0,
"max": 255,
"mean": 60.4429,
"median": 30,
"standardDeviation": 76.1642,
"kurtosis": 0.756071,
"skewness": 1.40876,
"entropy": 0.838816
},
"blue": {
"min": 0,
"max": 255,
"mean": 78.2214,
"median": 62,
"standardDeviation": 80.2692,
"kurtosis": -0.494016,
"skewness": 0.80254,
"entropy": 0.862173
}
},
"renderingIntent": "Perceptual",
"gamma": 0.454545,
"chromaticity": {
"redPrimary": {
"x": 0.64,
"y": 0.33
},
"greenPrimary": {
"x": 0.3,
"y": 0.6
},
"bluePrimary": {
"x": 0.15,
"y": 0.06
},
"whitePrimary": {
"x": 0.3127,
"y": 0.329
}
},
"matteColor": "#BDBDBD",
"backgroundColor": "#FFFFFF",
"borderColor": "#DFDFDF",
"transparentColor": "#00000000",
"interlace": "None",
"intensity": "Undefined",
"compose": "Over",
"pageGeometry": {
"width": 800,
"height": 800,
"x": 0,
"y": 0
},
"dispose": "Undefined",
"iterations": 0,
"compression": "JPEG",
"quality": 80,
"orientation": "Undefined",
"properties": {
"comment": "CREATOR: gd-jpeg v1.0 (using IJG JPEG v80), quality = 80\n",
"date:create": "2024-01-13T04:33:02+00:00",
"date:modify": "2024-01-13T04:33:02+00:00",
"date:timestamp": "2024-01-13T04:34:18+00:00",
"jpeg:colorspace": "2",
"jpeg:sampling-factor": "2x2,1x1,1x1",
"signature": "0d0e8995e2aae98c15e1e2bc69c8f988423e022cf4055d72e9752a457a53a440"
},
"tainted": false,
"filesize": "136972B",
"numberPixels": "640000",
"pixelsPerSecond": "40.1272MB",
"userTime": "0.020u",
"elapsedTime": "0:01.015"
}
} |
As per usual, Rick beat me to it. Different (older?) flavor:
|
Yeah I think here is fine cc @pmeenan |
@foolip not sure about venue (if I have a discussion that might require a chattier exploration, I generally start it in the HTTP Archive Slack), but @pmeenan is the person to ask about missing Interesting chart! I do worry though... the "quality" reported by ImageMagick's That said... the number of |
If you have examples for where it is missing I can take a look. It could happen if for some reason the image response body isn't available or the code that detects the image type by looking at the header bytes doesn't recognize it. heif is definitely not detected but the others should be reasonably up to date. |
I have been poking at https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2022/media/bytes_and_dimensions_by_format.sql to get an updated view of quality distributions in the wild.
I happened to look for 'heif' images and was surprised how many I found. It turns out that for example https://gaijincph.dk/ serves https://ftstorageprod.blob.core.windows.net/images/restaurant/5ad12db8/images/1e59c431-1198-4859-bebd-769d37d1a975_m.heic
That gets classified by
pithyType()
as 'heif' because the URL ends with '.heic'. However, it's actually a JPEG.It seems like only the URL is used in fact, because of this call here:
almanac.httparchive.org/sql/2022/media/bytes_and_dimensions_by_format.sql
Line 112 in ff9fd22
There is no
mimeType
in the data, at least not in thehttparchive.pages.2024_01_01_desktop
data. Here's what I unpacked frompayload
and a few nested JSON objects for https://gaijincph.dk/:Since the number of bytes and the decoded width and height are known, the decoder that was actually used should in principle be knowable.
The text was updated successfully, but these errors were encountered: