Say I have a large txt or CSV file with data I want to search. And say I have several files.
What is the best way to index and make this data searchable? I’ve been using grep, but it is not ideal.
Is there any self hostable docker container for indexing and searching this? Or maybe should I use SQL?
You can import CSV files directly into an SQLite database. Then you are free to do whatever sql searches or manipulations you want. Python and golang both have great SQLite libraries from personal experience, but I’d be surprised if there is any language that doesn’t have a decent one.
If you are running Linux most distros have an SQLite gui browser in the repos that is pretty powerful.
I’d be surprised if there is any language that doesn’t have a decent one.
Yeah, SQLite provides a library implemented in C. Because C doesn’t require a runtime, it’s possible for other languages to call into this C library. All you need is a relatively thin wrapper library, which provides an API that feels good to use in the respective language.
Import it into access(Satan whispers quietly into you ear)
Excel / OnlyOffice?
I love self-hosted tools, but you can do a lot on a spreadsheet.
Btw, if the files are not too large, you can query them using SQL without even hosting a database just by using Pandas. This avoids the problem of updating entries and handling migrations in case the CSVs change over time.
Could use Polars, afaik it supports streaming from CSVs too, and frankly the syntax is so much nicer than pandas coming from spark land.
Do you need to persist? What are you doing with them? A really common pattern for analytics is landing those in something like Parquet, Delta, less frequently seen Avro or ORC and then working right off that. If they don’t change, it’s an option. 100 gigs of CSVs will take some time to write to a database depending on resources, tools, db flavour, tbf writing into a compressed format takes time too, but saves you managing databases (unless you want to, just presenting some alternates)
Could look at a document db, again, will take time to ingest and index, but definitely another tool, I’ve touched elastic and stood up mongo before, but Solr is around and built on top of lucene which I knew elastic was but apparently so is mongo.
Edit: searchable? I’d look into a document db, it’s quite literally what they’re meant for, all of those I mentioned are used for enterprise search.
Importing that data into a RDBMS would be ideal. I’d use PostgreSQL for this but any other would work.
Why bother setting up a hosted DB server when you can get all of the RDBMS optimizations in an in-process service? DuckDB is pretty cool