Large Text File

Greetings
I am wanting to access and navigate a large text file. The file is about 40,000 lines with up to 200 characters per line. I can read and write to the file using fstream but would like to be able to navigate forwards and backwards. Is this possible?
Thanks
Richard
Yes, it is possible, though not the best use of resources. And is damned slooooooooow due to physical I/O issues.

Memory mapping your file is probably the route you want to take.

Ooops! Forgot the link!

https://bertvandenbroucke.netlify.app/2019/12/08/memory-mapping-files/
Last edited on
And "large" is in the "eye of the programmer."

Your text file is approximately 8MB.* That should easily fit complete into memory, heap memory.

A std::vector that stores std::strings would be a first-choice possibility.

Or a std::vector that stores the individual characters "flat."

Or a single std::string.

Choices can be refined by knowing how you want to search and display your data.

*Ickers, maths was never my strong suit in school, I thought I saw 80,000 lines. :*
Last edited on
That's only about 8,000,000 chars (plus possibly a little more for line terminators) - which isn't a particularly large file in the scheme of things. Why not just read the whole file into a std::string and then navigate/manipulate as required and then re-write the whole file if needed.

Or if you want each line split, read into a std::vector<std::string>

What are trying to do?
Last edited on
OK - thanks for the reply. The file in question has the potential to grow to larger than 250,000 lines. I want to be able to step forward and backward from a specific line. I'll give some of your suggestions a try.
Richard
Even 500K lines at 200 chars per line is still approximately 100MB. That is still rather smallish.
Even if the file grows to 5000,000 lines that's only about 100,000,000 (100MB). That's not what I'd consider to be a 'large' file. You should easily be able to read that into memory.
@K9WG,

Be glad you don't need to worry about accessing a file across multiple processes. :)

The idea of 100MB is a large amount of memory is so MS DOS 3.1ish.

"640K Ought to be Enough for Anyone" -- alleged Bill Gates quote.
Last edited on
@K9WG,

I don't know if you learned about stack vs. heap memory yet, it is something you do need to understand.

C++ Stack vs Heap | Top 8 Differences You Should Know
https://www.educba.com/c-stack-vs-heap/

The Heap is also known as the Free Store.

FYI, that linked article isn't really about C++, it is more C. C++ has additional support for allocating/deallocating heap memory using new and delete.
OK - I have come a long way since my question. I am looking at two solutions.

1. Using linked list
2. Using vectors

I have played around with both and they seem to work for my application. What would in your opinion would be the best approach? My main concern would be possible memory issues.

thanks

Richard - K9WG
When in doubt, use std::vector

Prefer using STL vector by default unless you have a reason to use a different container

https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rsl-vector

my recomendation is to use std::vector by default. More generally, use a contiguous representation unless there is a good reason not to.

https://www.stroustrup.com/bs_faq2.html
What are you doing with the data once read? Usually you'd use a std::vector. However, if you have say a requirement to add/delete many elements that aren't at the end, then perhaps a std::vector may not be the best choice on performance grounds. However directly accessing elements using a list can be very expensive...

Once you know the frequent operations you intend to do with the data once read, then see which data structure best suits for a) ease of performing the operation and b) performance. But before moving away from std::vector, have good reasons to.

and yes, a list uses more memory than a std::vector because of the need for memory for the links.

There's also std::deque if you need to add/delete both from front and back whilst also having the ability to access individual elements directly
https://cplusplus.com/reference/deque/deque/

This is one of those questions where you need some sort of explanation as to what you really want to do.
skipping lines, ok, you can do vector<string> or a list if you need to do that a lot. If this is a case of read it once, search it a bunch, that could make sense. But then questions like "do you need to delete a line, insert a line, etc?" come into play. Adding a line to the middle of vector<string> is painful. Adding a line to a linked list in the middle is less so. But linked lists are difficult to search and tend to scatter their memory to the 4 winds making them sluggish.

so, its a combination of "do you know the strengths and weakness of the standard containers" combined with "what do you want to do most of the time with this wad of data".

if its some really high performance task, you can string together containers or build your own as well. Nothing says best of both worlds like a small linked list class coded using a vector as its memory storage and the index as the "pointers". Then you can efficiently iterate all the lines, or jump around / insert /delete lines on the cheap, and so on. But C++ does not provide that one, you would have to cobble it together.

even 5 years ago I was cranking through 4+ GB xml files, loaded as a single block and searched as such with no problems on a low end work provided laptop. Today my desktop has 64 gb of ram, so loading a 32GB file into it as one block is likely possible if nothing much else is running. And searching it, well, I have 20 cores now, each one can tackle 2GB or so small chunks in a thread and still leave a core for the OS to do whatever it does. High end, sure, but its also a home PC and not even touching the capability of a commercial system. Your task here may seem large, but it is not (if targeting PC sized platforms). It may be sluggish if you do it wrong, but it won't be if you take care.
Last edited on
> I have played around with both and they seem to work for my application.

Keep it simple then; don't overthink, don't attempt any premature optimisation; opt for the simplest solution that works for you. In general, strongly favour programmer efficiency over machine efficiency.
K9WG wrote:
OK - I have come a long way since my question. I am looking at two solutions.

1. Using linked list
2. Using vectors

Which C++ container you choose, there is more than those two available, depends for a large part how you plan to manipulate and massage the stored data. Usage also can be a consideration of how the data is stored in the container.

The pros and cons of each C++ container is something you should research:
https://en.cppreference.com/w/cpp/container

There are two basic types of C++ containers, Sequence and Associative. There's a sub-category of Unordered Associative containers, and a handful of Container adapters.

If none of the current containers fits well enough to do what you want without a lot of programmatic hoop-jumping you can always cruft an adapter of your own.

I do suggest continuing what you are currently doing, trying out several containers and see if you can use what is already available.
For files full of lines I tend to use std::deque<std::string> over std::vector<std::string>. Inserting and deleting lines becomes much less expensive an operation for large numbers of lines, and is friendlier for memory access.
Neat, it's always interesting to hear cases where people don't just use std::vector.
OK - After a lot of experimenting I have decided on deque to handle my text files. Thanks for all the input and expertise.

Richard - K9WG
Topic archived. No new replies allowed.