Writing large object to file is slow #43

borkdude · 2019-03-29T19:42:25Z

Problem:

Writing a large object to a file with transit-clj was slower than I expected. It takes about 1 second with transit-clj and roughly 1/10th of that time with regular spit.

Test data:

test.edn.zip
Extract to test.edn.

Repro:

clj -Sdeps '{:deps {com.cognitect/transit-clj {:mvn/version "0.8.313"}}}'
(require '[clojure.edn :as edn])
(def edn (edn/read-string (slurp "test.edn")))
(count (keys edn)) ;;=> 799
(require '[cognitect.transit :as transit])
(require '[clojure.java.io :as io])
(def writer (transit/writer (io/output-stream (io/file "transit.json")) :json))
(time (transit/write writer edn)) ;; prints:
"Elapsed time: 1151.116438 msecs"

While writing the same EDN to a file (printing with str is much faster):

(time (with-open [fos (java.io.FileOutputStream. "/tmp/foo.edn")] (let [w (io/writer fos)] (.write w (str edn)) (.flush fos)))) ;; prints:
"Elapsed time: 73.316135 msecs"

Flamegraphs

Created with https://github.com/clojure-goes-fast/clj-async-profiler

Flamegraph of transit:
transit-flamegraph.svg.zip

Flamegraph of EDN to file:
edn-flamegraph.svg.zip

The text was updated successfully, but these errors were encountered:

strssndktn · 2020-07-03T16:52:01Z

@borkdude I investigated this a bit and it seems that transit calls .flush overly aggressive, thus forcing the data to be flushed out to disk on every element (I am not completely sure how and when .flush gets called or what are the reasons for that). As an experiment, could you put a proxy in between your output-stream and transit/writer overwriting .flush with a noop function?

I think it makes sense if transit-java calls .flush only once for each top-level write.

borkdude · 2020-07-03T19:12:09Z

@strssndktn I worked around this by first writing to a ByteArrayOutputStream for which .flush is a no-op and that helped a lot.

RokLenarcic · 2023-05-23T11:27:45Z

@borkdude writing to BAOS first will require that you hold the whole output in memory which is undesireable. Instead you can try wrapping the output stream into your own OutputStream subclass that makes .flush a no-op.

tonsky · 2023-07-21T18:06:09Z

I’m hitting the same issue. This did the trick for me:

(defn no-flush-output-stream [^OutputStream os]
  (proxy [BufferedOutputStream] [os]
    (flush [])
    (close []
      (.flush os)
      (.close os))))

tonsky · 2023-07-25T18:34:55Z

UPD: that actually should’ve been

(defn no-flush-output-stream [^OutputStream os]
  (proxy [BufferedOutputStream] [os]
    (flush [])
    (close []
      (proxy-super flush)
      (proxy-super close))))

Works like a charm

ericdallo · 2023-07-27T00:55:10Z

@tonsky I just updated clojure-lsp to use your suggestion and improved memory usage indeed for huge maps, thanks for that!

borkdude · 2023-07-27T09:46:43Z

Thanks. I needed to add a couple more type hints to make it work without reflection:

(defn no-flush-output-stream
  "See https://github.com/cognitect/transit-clj/issues/43#issuecomment-1650341353"
  ^java.io.OutputStream [^java.io.OutputStream os]
  (proxy [java.io.BufferedOutputStream] [os]
    (flush [])
    (close []
      (let [^java.io.BufferedOutputStream this this]
        (proxy-super flush)
        (proxy-super close)))))

tonsky · 2023-07-27T14:55:36Z

Did the exact same change yesterday myself! Look at that, two senior Clojure devs got 10-line function right after just 3 attempts!

timewald · 2023-07-28T15:00:17Z

On Fri, Jul 3, 2020 at 12:52 PM strssndktn ***@***.***> wrote: @borkdude <https://github.com/borkdude> I investigated this a bit and it seems that transit calls .flush overly aggressive, thus forcing the data to be flushed out to disk on every element (I am not completely sure how and when .flush gets called or what are the reasons for that)

The reason flush gets called so often is that the original use case or transit was for a wire protocol. In that mode, we wanted to flush each entity to the network when it was done being written. This is the same reason that the underlying transit reader does not assume that a single read gets you to the end of a stream. This could potentially be made an option that gets passed in when you create a writer. If memory serves me, it would have to flow down to the Java implementation. I am not advocating for any change, just wanted to answer the question about why it is the way it is. I am glad you found a workaround that works for you. Tim-

tonsky · 2023-07-28T15:29:16Z

Well, the problem is not that transit flushes at the end of writing an entity. Problem is that it flushes multiple times in the middle of writing entity. I don’t think that could serve any purpose, it must’ve been an implementation bug

timewald · 2023-07-28T16:04:22Z

It flushes for any complex entity it writes, including those that are nested. So if you have an array of 100 maps, it will flush 101 times - once for each map and once for the array. From a network streaming perspective, I think that's preferable, but I can see the other perspective for sure.

…

On Fri, Jul 28, 2023 at 11:29 AM Nikita Prokopov ***@***.***> wrote: Well, the problem is not that transit flushes at the end of writing an entity. Problem is that it flushes multiple times in the middle of writing entity. I don’t think that could serve any purpose, it must’ve been an implementation bug — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAHXHPB3CVAX7ZJ3XN2CHDXSPLFNANCNFSM4HCMPDXQ> . You are receiving this because you commented.Message ID: ***@***.***>

tonsky · 2023-07-28T16:57:28Z

Why is it preferable from network perspective? All these maps are wrapped in a list, and you can’t read part of the list on the other side, can you?

timewald · 2023-07-28T17:21:30Z

You can start the process of reading on the other side. The Jackson parser is stream oriented, so while the transit reader won't produce a result until, say, an entire array has arrived, it can be reading and decoding the bytes while the writer is still producing output.

…

On Fri, Jul 28, 2023 at 12:57 PM Nikita Prokopov ***@***.***> wrote: Why is it preferable from network perspective? All these maps are wrapped in a list, and you can’t read part of the list on the other side, can you? — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAHXHPMSFFXGSJQD2FN5ETXSPVQJANCNFSM4HCMPDXQ> . You are receiving this because you commented.Message ID: ***@***.***>

tonsky · 2023-07-28T17:33:20Z

Sure, but that’s why you wrap your OutputStream in BufferedOutputStream. It’ll buffer enough for you and flush itself when buffer runs out

timewald · 2023-07-28T19:53:46Z

Yeah we could have done that too. I can't honestly remember at this point if we considered it or not. :)

…

On Fri, Jul 28, 2023 at 1:33 PM Nikita Prokopov ***@***.***> wrote: Sure, but that’s why you wrap your OutputStream in BufferedOutputStream. It’ll buffer enough for you and flush itself when buffer runs out — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAHXHMS7LHVCFFPFPL2QW3XSPZWVANCNFSM4HCMPDXQ> . You are receiving this because you commented.Message ID: ***@***.***>

SevereOverfl0w mentioned this issue Mar 4, 2020

When using with HTTP Chunked Transfer Encoding, 6.7MB becomes 10.5MB due to chunking overheads #46

Open

strssndktn mentioned this issue Jul 8, 2020

flush behavior during packing cognitect/transit-java#33

Open

borkdude mentioned this issue Jul 27, 2023

Use no-flush-outputstream rather than byte array outputstream clj-kondo/clj-kondo#2147

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing large object to file is slow #43

Writing large object to file is slow #43

borkdude commented Mar 29, 2019 •

edited

Loading

strssndktn commented Jul 3, 2020 •

edited

Loading

borkdude commented Jul 3, 2020

RokLenarcic commented May 23, 2023

tonsky commented Jul 21, 2023

tonsky commented Jul 25, 2023 •

edited

Loading

ericdallo commented Jul 27, 2023

borkdude commented Jul 27, 2023

tonsky commented Jul 27, 2023

timewald commented Jul 28, 2023 via email

tonsky commented Jul 28, 2023

timewald commented Jul 28, 2023 via email

tonsky commented Jul 28, 2023

timewald commented Jul 28, 2023 via email

tonsky commented Jul 28, 2023

timewald commented Jul 28, 2023 via email

Writing large object to file is slow #43

Writing large object to file is slow #43

Comments

borkdude commented Mar 29, 2019 • edited Loading

Problem:

Test data:

Repro:

Flamegraphs

strssndktn commented Jul 3, 2020 • edited Loading

borkdude commented Jul 3, 2020

RokLenarcic commented May 23, 2023

tonsky commented Jul 21, 2023

tonsky commented Jul 25, 2023 • edited Loading

ericdallo commented Jul 27, 2023

borkdude commented Jul 27, 2023

tonsky commented Jul 27, 2023

timewald commented Jul 28, 2023 via email

tonsky commented Jul 28, 2023

timewald commented Jul 28, 2023 via email

tonsky commented Jul 28, 2023

timewald commented Jul 28, 2023 via email

tonsky commented Jul 28, 2023

timewald commented Jul 28, 2023 via email

borkdude commented Mar 29, 2019 •

edited

Loading

strssndktn commented Jul 3, 2020 •

edited

Loading

tonsky commented Jul 25, 2023 •

edited

Loading