Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion - Post library download statistics/analytics numbers & graphs #1078

Open
Jakobud opened this issue Mar 22, 2013 · 35 comments
Open

Comments

@Jakobud
Copy link

Jakobud commented Mar 22, 2013

Does CDNJS keep track of how many times individual files are pulled from their CDN? It would be absolutely awesome if CDNJS did this and then had up-to-update charts and data tables showing how many downloads each file is getting.

This would be epic because it would allow developers to choose which library versions they want their users downloading. The developer would want to choose the library version that is obviously compatible with their code, but also they would choose the one that is downloaded the most over the past X amount of time.

For example, lets say the latest jQuery just gets released and put on CDNJS. A couple days pass and the stats for jQuery look like this for the past week:

jQuery 1.9.1 = 20,000 downloads
jQuery 1.9.0 = 50,000 downloads
jQuery 1.8.3 = 560,000 downloads
jQuery 1.8.2 = 120,000 downloads
etc...

The developer can look at this and know that it's more likely that their visitors are already going to have jQuery 1.8.3 cached as opposed to 1.9.1 since it's new. So as long as their code is 1.8.3 compatible, they would choose this one.

And since these numbers change over time, maybe a month later the developer comes back to CDNJS and see's now that the 1.9.1 stats are higher than 1.8.3, so again, as long as his code is 1.9.1 compilant, he could safely switch his site to use 1.9.1 since his visitors are now more likely to already have 1.9.1 cached.

Does this make sense? To me it would be EXTREMELY useful. The whole point of CDNJS is so that developers share libraries and resources. So over time, as more and more libraries get added to CDNJS and more and more versions of those libraries are added, it would be invaluable to have a tool like this in order for the developers to make informed decisions based on which libraries and resources are being shared the most.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@ryankirkman
Copy link
Member

@Jakobud Great suggestion Jake. You're absolutely right that this would really useful, and it is a popular request: #405

We're brainstorming solutions right now, so we're glad to have you as part of the conversation.

@Lockyc
Copy link
Contributor

Lockyc commented Mar 23, 2013

Closed old issue #405 continue conversation here

@thomasdavis
Copy link
Member

Tagged as high priority, anyone have any brilliant ideas yet on how to parse a few billion lines?

@Jakobud
Copy link
Author

Jakobud commented Jun 27, 2013

How many lines is the typical log file? Do you split the log files up to one-per-day or smaller? Do the log files simply say what http://path/file was downloaded? Or does it have references to database row id's (id's of each filename which I assume are stored in a database)?

@ryankirkman
Copy link
Member

Each edge location (currently 23) is treated independantly from every other.

So what we have is one or more log files per edge location per day, and we
are getting a significant number of hits.

On Friday, June 28, 2013, Jake Wilson wrote:

How many lines is the typical log file? Do you split the log files up to
one-per-day or smaller?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1078#issuecomment-20125044
.

@Jakobud
Copy link
Author

Jakobud commented Jun 28, 2013

If you could post excerpts of the log files, that would be a place to start.

@Jakobud
Copy link
Author

Jakobud commented Jul 31, 2013

Any progress on this? You guys need any help with it? I know there are probably a lot of huge log files, but I think it would only be a matter of a simply python script that streamed in the log files and saved out the data to a database or something like that. It would be a long running process but it probably wouldn't be that complicated really.

@Jakobud
Copy link
Author

Jakobud commented Aug 7, 2014

FYI, I don't know if cdnjs utilizes AWS services on the backend or not, but this is an interesting article that is potentially very relevant to this issue:

http://aws.amazon.com/blogs/aws/all-your-data-fluentd/

It discusses using software called Fluentd to stream logfile changes into data storage. So for CDNJS, it could stream library access logs into some sort of usage database that could be used to display usage statistics.

@Jakobud
Copy link
Author

Jakobud commented Aug 7, 2014

Also, FYI you guys could get someone to help you with a solution for this if you could divulge details about your logging. How it works, where the files are stored, give us access to a day or weeks worth of logs, etc... Someone could figure out a solution for you.

@Jakobud
Copy link
Author

Jakobud commented Aug 7, 2014

Another suggestion for you guys, just make your logs public. Put them up on AWS S3 or something and allow anyone to grab them. I GUARANTEE someone (or multiple people probably) will come up with an analytics solution for you.

@Jakobud
Copy link
Author

Jakobud commented Dec 11, 2014

Just wanted to reach out regarding this issue again. I'll say it again, provide some example log files and someone somewhere will put together a parser for you that will pull library download stats.

@PeterDaveHello
Copy link
Contributor

ping @thomasdavis

@IonicaBizau
Copy link
Contributor

Creating an api service for cdnjs would be nice. Something like:

api.cdnjs.com/lib/jquery/stats

Then, we can use this service to fetch the stats in the cdnjs website. 🍀

@PeterDaveHello
Copy link
Contributor

Stats from website is easy, but people want the stats from cdn, I remember that cloudflare didn't give us that info or access log.

cc @thomasdavis @ryankirkman @terinjokes

@ryankirkman
Copy link
Member

We can get access to the logs, but the log volume is so large we need to
figure out an aggregation strategy
On Sun, May 24, 2015 at 12:28 AM Peter Dave Hello [email protected]
wrote:

Stats from website is easy, but people want the stats from cdn, I remember
that cloudflare didn't give us that info or access log.

cc @thomasdavis https://github.com/thomasdavis @ryankirkman
https://github.com/ryankirkman @terinjokes
https://github.com/terinjokes


Reply to this email directly or view it on GitHub
#1078 (comment).

@davidbau
Copy link
Contributor

Approximate stats would be nearly as good. If log volume is a problem, logs could be sampled.

@thomasdavis
Copy link
Member

That is true! Even one day of traffic * 30 would be interesting enough.

@Jakobud
Copy link
Author

Jakobud commented May 26, 2015

Where are the logs now? Are they accessible in any form? I would think dumping daily logs on some S3 storage would be feasible and then someone could write something that parses them.

@IonicaBizau
Copy link
Contributor

I would be excited to write a tool to parse the logs! I'm anyway involved in some statistics & visualizations projects, so that would be awesome. 🎇

@Jakobud
Copy link
Author

Jakobud commented May 26, 2015

Like I said before, all CDNJS needs to do is make the logs accessible in some form, and someone will step up to write a cool parser to generate usage stats.

@PeterDaveHello
Copy link
Contributor

We are doing now, the IP address in the log will be sensitive, should be careful.

@fj
Copy link

fj commented Jul 13, 2015

Any update on this? Throwing my hat in the ring as another person who'd be willing to write a parser.

@PeterDaveHello
Copy link
Contributor

Hey dear all, I'm afraid not, there are some issues more important, but will try our best to have this feature asap.

@PeterDaveHello
Copy link
Contributor

BTW, thanks for guys want to write parser for us, if you don't mind, you can still contribute to other parts of cdnjs, like bower auto-updater or something, thanks!

@Jakobud
Copy link
Author

Jakobud commented Nov 19, 2015

Any more updates on this one? It's been over 2 1/2 years. Have you guys just considered making your logs publicly accessible in some form?

Help us Help you!

@PeterDaveHello
Copy link
Contributor

though ping @thomasdavis @ryankirkman @terinjokes @drewfreyling ...

@Jakobud
Copy link
Author

Jakobud commented Nov 19, 2015

Hey so I know that back on #405 the issue was money. The logs are in Common Format, however to pull down the logs for 5 million hits its $300 per day or something like that. (2 1/2 years later you guys probably get WAY more than 5 million hits a day).

So the solution thrown out there was to setup a parse on an EC2 instance. This would be the best solution. As long as your EC2 instance is in the same region as your S3 container, there is no cost to transfer you log files from S3 to your EC2 instance.

So essentially, the solution would be to have some sort of daily task that happens:

  1. EC2 instance starts up
  2. Script pulls logs for last 24 hours from S3 container
  3. Script parses logs
  4. Script deletes local log
  5. Script dumps the data in whatever form you want into some database somewhere
  6. Script terminates EC2 instance

So this would be an absolute minimal cost. You would only pay for the time the instance is active. Scheduling an EC2 instance to turn on every 24 hours shouldn't be too hard. And I'm pretty sure you can self-terminate an EC2 instance programmatically.

Just a thought. It honestly wouldn't be too terribly difficult to figure out...

@Jakobud
Copy link
Author

Jakobud commented Nov 19, 2015

Actually an even better solution would be using AWS Data Pipeline

http://aws.amazon.com/documentation/data-pipeline/

And AWS Elastic Map Reduce

https://aws.amazon.com/elasticmapreduce/

Those tools are made to do exactly what you guys need to do: Analyze data/logs in a cost efficient manner.

@ryankirkman
Copy link
Member

Hi Jake,

The solution you proposed is very elegant, but unfortunately we don't use
Cloudfront for hosting the CDN anymore. Cloudflare is the primary network
provider.

As for a stats solution, we don't have a good answer yet sorry Jake.
On Thu, Nov 19, 2015 at 9:35 AM Jake Wilson [email protected]
wrote:

Actually an even better solution would be using AWS Data Pipeline

http://aws.amazon.com/documentation/data-pipeline/

And AWS Elastic Map Reduce

https://aws.amazon.com/elasticmapreduce/

Those tools are made to do exactly what you guys need to do: Analyze
data/logs in a cost efficient manner.


Reply to this email directly or view it on GitHub
#1078 (comment).

@PeterDaveHello
Copy link
Contributor

@ryankirkman can we evaluate the disk size we need per day, and maybe i can find the storage.

@Jakobud
Copy link
Author

Jakobud commented Nov 19, 2015

Are Cloudflare logs accessible to you in some form, downloadable or via an API or anything? Also, EC2 transfer pricing:

Data Transfer IN To Amazon EC2 From Internet $0.00 per GB

https://aws.amazon.com/ec2/pricing/

So I assume that means you could programmatically pull in Cloudflare logs and parse them or do whatever and it would still only cost you for the time the EC2 instance is active.

@dazbradbury
Copy link

Looks like this issue has been pretty stagnant - is there now an alternative / feasible solution for determining library usage stats or percentages?

Taking the jQuery example - as a site owner you care about % of users arriving with the required jquery version already cached, and any stats cdnjs can provide would be awesome in determining that.

@MattIPv4
Copy link
Member

Currently waiting on Cloudflare to establish a way for us to have stats/log access for the cdnjs.cloudflare.com domain. Will post updates as I get them.

@MattIPv4
Copy link
Member

Noted from #6186 that more in-depth stats would be useful such as country breakdowns.

@MattIPv4
Copy link
Member

MattIPv4 commented Jun 6, 2019

@dknecht Please can we use this issue to track any updates on further stats/logs access to the cdnjs.cloudflare.com domain. Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

14 participants