Streaming of sequence elements

raboof · October 2018

If I have a slow sequence (IEnumerable<T>) that is I/O bound then I have to wait until the entire sequence has been enumerated for LINQPad to display it in the Results pane. Is there a reason LINQPad doesn't display elements in a streaming fashion, as they are yielded by the enumerator? It seems to do this for observables (IObservable<T>).

JoeAlbahari · October 2018

LINQPad assumes that IEnumerable sequences are synchronous and so it displays the results in one go as an optimization (doing so is more efficient than streaming the elements if all the elements arrive together).

What sort of sequence do you have that's IEnumerable and streaming? These are quite rare. With most I/O-bound IEnumerables, the blocking happens before the first element arrives, and then everything else arrives quickly after that.

Note that LINQPad also streams IAsyncEnumerable. So you can make an IEnumerable display asynchronously by referencing a library that contains an IAsyncEnumerable implementation and calling .ToAsyncEnumerable on it (or writing an extension method to do that).

raboof · October 2018

Sequence enumeration is usually designed to be lazy and to stream the results, whenever possible. Most (though not all) LINQ operations are designed that way. Isn't it only natural then to expect that the elements will be displayed as they are yielded?

What sort of sequence do you have that's IEnumerable and streaming?

The following example is only for illustration but it is representative of the problem:


using (var http = new HttpClient())
{
    var news =
        from urls in new[]
        {
            from Match m in
                Regex.Matches(await http.GetStringAsync("https://cnn.com/"),
@[a-z]+)=("(?.+?)""|'(?.+?)'))+")
            select new[] { "n", "v" }.Select(g => from Capture c in m.Groups[g].Captures select c.Value)
                                     .Fold((ns, vs) => ns.Zip(vs, (n, v) => new { Name = n, Value = v }))
                                     .FirstOrDefault(a => "href".Equals(a.Name, StringComparison.OrdinalIgnoreCase))?.Value
            into href
            where href != null
            select new Uri(new Uri("https://cnn.com/"), href) into url
            where Regex.IsMatch(url.AbsolutePath, @^/(?x: world
                                                         | politics
                                                         | business
                                                         | entertainment
                                                         | sport
                                                         | travel
                                                         | style
                                                         | health)$")
            select url
        }
        from e in urls.Distinct().Await(async (url, ct) => new
        {
            Url     = url,
            Content = await (await http.SendAsync(new HttpRequestMessage(HttpMethod.Get, url), ct))
                                       .Content.ReadAsStringAsync(),
        })
        select new
        {
            e.Url,
            Title = Regex.Match(e.Content, @(?<=<title( .+?)?>).+?(?=)).Value
        };

    news.Dump();
}

I have also uploaded the query if you want to take it for a spin. It uses Await from MoreLINQ but that's irrelevant (although I've also shared a version without).

Another very simple example would be converting any observable to a sequence via ToEnumerable().

doing so is more efficient than streaming the elements if all the elements arrive together

I don't understand how this matters when we are talking about just for display?

Note that LINQPad also streams IAsyncEnumerable.

Yep, am aware of that option as well as going from sequences to observables but I don't see why force people to go down that route when it's not needed? Your main thread is usually synchronous and blocking.

I want to be able to write an expression query and see the results stream to the Results pane as they arrive. I have to work round this by turning the query into C# statements and use a foreach loop just to get the right effect.

Streaming is even more crucial when running on the command-line, as in lprun -format=csv query.linq.

JoeAlbahari · October 2018

Would you also expect the results to stream if the sequence is contained within another object or sequence?

For instance:

new { X = somesequence }.Dump();

raboof · October 2018

Yes, in the interactive mode, when using LINQPad. Eventually you have to enumerate the sequence to display it and it happens synchronously, albeit in a buffered fashion, today so why not let that stream out too? Why make an artificial exception or boundary? In effect, you'd see the dump “build-out”. I reckon that most sequences will be fast so it won't change anything from today.

For LPRun, however, one can't expect sub-sequences to be streamed out, especially when using the CSV formats (to point out the extreme case) because the structure would have to make sense as a whole. At the same time, if the root is a sequence then its items should be streamed as they are yielded. This is important for extremely large results. Imagine a query that scans files of hourly data, summarizes to daily via grouping as a single sequence. If you have thousands of such files, you don't want to have to wait or have to re-model as an asynchronous sequence just to get streaming.

Just to be clear, if you have this:

new { X = slowSequence1,
      Y = slowSequence2 }.Dump();

then there's no expectation that X and Y are streamed out at the same time, even interactively in LINQPad. X would stream first and then dumping would proceed to Y.

Now, depending on how fancy you want to get, it's up to you to decide whether you want to detect a “slow sequence” and switch from buffering to streaming. For example, one could define a tolerance of 500 ms. If the sequence isn't done by then, switch to streaming. It would be reasonable compromise if you want to maintain today's behaviour of assuming sequences are generally fast, but I think this isn't necessary and feels like an overkill.

By the way, any query using a sequence from Directory.EnumerateFileSystemEntries that operates recursively through a tree is going to be slow

so I think slow sequences are far more common than one would like think.

JoeAlbahari · October 2018

I've looked into how this might be implemented, but so far I can't find a solution that isn't prohibitively difficult. To see why, run the following query and expand the result by pressing Alt+0:

System.Globalization.CultureInfo.CurrentCulture

There are upwards of a 100 sequences in that object graph (they are all arrays, but let's imagine they're IEnumerable sequences).

Right now this is rendered in a single operation, in a single round-trip. Your query calls Dump which converts the object graph into a graph of meta-nodes, then visits them with an HTML writer (or JSON writer if you choose Text results in LPRun) and sends the HTML to the host process to render. If LINQPad rendered each sequence lazily, this architecture would have to be completely redesigned.

Another problem is that your query process would potentially need to make hundreds of round-trips to the host, after each sequence splicing in another piece of HTML via the browser DOM. This would cause a noticeable delay and flicker. To avoid this, it would need to batch together the HTML updates that occur in rapid sequence, which itself is not hard (it does this anyway with observables and IAsyncEnumerable). What's hard is converting this batch into efficient browser DOM actions. Because now you're not just adding rows to a table, but performing operations that alter the DOM at multiple levels.

I think it would be achievable for top-level dumps, but I'm not sure how useful that would be.

raboof · October 2018

I think it would be achievable for top-level dumps, but I'm not sure
how useful that would be.

It is very useful (for all the reasons I listed earlier) and a good compromise. Without this, the CSV format is fairly useless except for the fastest queries, with no I/O-bound operations.

If I have an expression query that yields hundreds of thousands of rows bound by I/O-bound operations, then it makes things like the following useless:

PS> lprun -format=csv test.linq | select -First 10

One can always buffer or stream based on the actual run-time type. If it's strictly IEnumerable<T> then stream, but as in the case of a full CultureInfo dump, most objects will be arrays, lists or collections and so those could be rendered in single round-trip, like today.

The HTML and JSON cases are different. Unlike CSV, which represents unbounded tabular data, they are documents with a root. One wouldn't naturally expect that the document is streamed out (unless you're supporting advanced stream-based document processing upstream; thinking XSLT here).

BTW, not sure how all the efficiency problems you've mentioned are any different for observables and asynchronous sequences.

Streaming of sequence elements

Comments

Categories